├── .gitignore
├── INSTRUCTIONS.md
├── README.md
├── Step00
    ├── README.md
    └── Slide0_Notebook.ipynb
├── Step01
    ├── .gitkeep
    └── README00.md
├── Step02
    ├── README01.md
    ├── make_venv.sh
    └── requirements.txt
├── Step03
    ├── README02.md
    ├── make_venv.sh
    ├── requirements.txt
    ├── titanic_model.py
    └── titanic_model.sh
├── Step04
    ├── README03.md
    ├── make_venv.sh
    ├── requirements.txt
    ├── titanic_model.py
    └── titanic_model.sh
├── Step05
    ├── README04.md
    ├── make_venv.sh
    ├── requirements.txt
    ├── titanic_model.py
    └── titanic_model.sh
├── Step06
    ├── README05.md
    ├── make_venv.sh
    ├── requirements.txt
    ├── titanic_model.py
    └── titanic_model.sh
├── Step07
    ├── README06.md
    ├── make_venv.sh
    ├── requirements.txt
    ├── titanic_model.py
    └── titanic_model.sh
├── Step08
    ├── README07.md
    ├── make_venv.sh
    ├── requirements.txt
    ├── titanic_model.py
    └── titanic_model.sh
├── Step09
    ├── README08.md
    ├── make_venv.sh
    ├── requirements.txt
    ├── titanic_model.py
    └── titanic_model.sh
├── Step10
    ├── README09.md
    ├── make_venv.sh
    ├── requirements.txt
    ├── titanic_model.py
    └── titanic_model.sh
├── Step11
    ├── README10.md
    ├── make_venv.sh
    ├── requirements.txt
    ├── titanic_model.py
    └── titanic_model.sh
├── Step12
    ├── README11.md
    ├── make_venv.sh
    ├── requirements.txt
    ├── titanic_model.py
    └── titanic_model.sh
├── Step13
    ├── README12.md
    ├── make_venv.sh
    ├── requirements.txt
    ├── titanic_model.py
    └── titanic_model.sh
├── Step14
    ├── README13.md
    ├── make_venv.sh
    ├── requirements.txt
    ├── titanic_model.py
    └── titanic_model.sh
├── Step15
    ├── README14.md
    ├── make_venv.sh
    ├── requirements.txt
    ├── titanic_model.py
    └── titanic_model.sh
├── Step16
    ├── README15.md
    ├── make_venv.sh
    ├── requirements.txt
    ├── titanic_model.py
    └── titanic_model.sh
├── Step17
    ├── README16.md
    ├── make_venv.sh
    ├── requirements.txt
    ├── titanic_model.py
    └── titanic_model.sh
├── Step18
    ├── README17.md
    ├── make_venv.sh
    ├── requirements.txt
    ├── titanic_model.py
    └── titanic_model.sh
├── Step19
    ├── README18.md
    ├── make_venv.sh
    ├── requirements.txt
    ├── titanic_model.py
    └── titanic_model.sh
├── Step20
    ├── README19.md
    ├── make_venv.sh
    ├── requirements.txt
    ├── titanic_model.py
    └── titanic_model.sh
├── Step21
    ├── README20.md
    ├── make_venv.sh
    ├── requirements.txt
    ├── titanic_model.py
    └── titanic_model.sh
├── Step22
    ├── README21.md
    ├── make_venv.sh
    ├── requirements.txt
    ├── titanic_model.py
    └── titanic_model.sh
├── create_branch.sh
├── create_instructions.sh
├── create_sqlite.py
├── data
    └── titanic.db
├── make_venv.sh
├── requirements.txt
└── test_all.sh


/.gitignore:
--------------------------------------------------------------------------------
1 | .venv/
2 | .vscode/
3 | .idea/
4 | .ipynb_checkpoints/
5 | __pycache__
6 | __pycache__/*
7 | data/*.pkl
8 | 


--------------------------------------------------------------------------------
/INSTRUCTIONS.md:
--------------------------------------------------------------------------------
  1 | ### Step 01: Project setup
  2 | 
  3 | - Write script to create virtual environment
  4 | - Write the first `requirements.txt`
  5 | 
  6 | You can select a different `setuptools` version or pin the package versions.
  7 | ### Step 02: Code setup
  8 | 
  9 | - Write python script stub with `typer`
 10 | - Write shell script to execute the python script
 11 | 
 12 | `Typer` is an amazing tool that turns any python script into shell scripts. Here we use it for future-proofing because at the moment there are no CLI arguments.
 13 | 
 14 | The program will be defined in a class that is instantiated by the `main()` function and call its main `run()` entry point. The `main()` function will be called by `typer` to pass any CLI parameters. This setup will allow us to create a "plugin" architecture and construct different behaviour (e.g.: normal, test, production) in different main functions. This is a form of "Clean Architecture" where the code (the class) is independent of the infrastructure that calls it (`main()`) more on this: [Clean Architecture: How to structure your ML projects to reduce technical debt (PyData London 2022)](https://laszlo.substack.com/p/slides-for-my-talk-at-pydata-london).
 15 | ### Step 03: Move code out of the notebook
 16 | 
 17 | - Copy-paste everything into the `run()` function
 18 | 
 19 | First step is to get started. There will be plenty of steps to structure the code better.
 20 | ### Step 04: Move over the tests
 21 | 
 22 | - Copy paste tests and testing code from the notebook in `Step00` into the `run()` function.
 23 | 
 24 | This will implement very simple end-to-end testing which is less effort than unit testing given that the code is not really in a testable state. It caches the value of some variables and the next time you run the code it will compare it to this cache. If they match you didn't change the behaviour of the code with the last change. If your intentions was indeed to change the behaviour, verify from the output of the `AssertionError` that the changes are working as intended. If they are, delete the chaches and rerun the code to generate new reference values. The tests should be such that if they fail they produce meaningful differences. So instead of aggregate statistics (like an F1 score) test the datasets itself. That way even small changes won't go undetected. Once the code is refactored you can write different type of tests but that's a different story.
 25 | ### Step 05: Decouple from the database
 26 | 
 27 | - Write `SQLLoader` class
 28 | - Move database related code into it
 29 | - Replace database calls with interface calls in `run()`
 30 | 
 31 | This is a typical example of the Adapter Pattern. Instead of directly calling the DB, we access it through an intermediary preparing for establishing "Loose Coupling" and "Dependency Inversion". In Clean Architecture the main code (the `run()` function) shouldn't know where the data is coming from, just what the data is. This will bring flexibility because this adapter can be replaced with another one that has the same interface but gets the data from a file. After that you can run your main code without a database which makes it more testable. More on this: [You only need 2 Design Patterns to improve the quality of your code in a data science project](https://laszlo.substack.com/p/you-only-need-2-design-patterns-to).
 32 | ### Step 06: Decouple from the database
 33 | 
 34 | - Create loader propery and argument in `TitanicModelCreator.__init__()`
 35 | - Remove the database loader instantiation from the `run()` function
 36 | - Update `TitanicModelCreator` construction to create the loader there
 37 | 
 38 | This will enable for the `TitanicModelCreator` to load data from any source for example files. Preparing to build a test context for rapid iteration. After you created the adapter class, this will do the decoupling. This is an example of "Dependency Injection", when a property of your main code is not written into the main body of the code but instead "plugged in" at constrcution time. The benefit of Dependency Injection is that you can change the behaviour of your code without rewriting it by purely changing its construction. As the saying goes: "Complex behaviour is constructed not written." Dependency Injection Principle is the `D` in the famed `SOLID` principles, and arguably the most important.
 39 | ### Step 07: Write testing dataloader
 40 | 
 41 | - Write a class that loads the required data from files
 42 | - Same interface as `SqlLoader`
 43 | - Add a "real" loader to it as a property
 44 | 
 45 | This will allow the test context to work without DB connection and still have the DB as a fallback when you run it for the first time. For `TitanicModelCreator` the two loaders are indistinguishable as they have the same interface.
 46 | ### Step 08: Write the test context
 47 | 
 48 | - Create `test_main` function
 49 | - Make sure `typer` calls that in `typer.run()`
 50 | - Copy the code from `main` to `test_main`
 51 | - Replace `SqlLoader` in it with `TestLoader`
 52 | 
 53 | From now on this is the only code that is tested. The costly connection to the DB is replaced with a file load. Also if it is still not fast enough, additional parameter can reduce the amount of data in the test to make the process faster. [How can a Data Scientist refactor Jupyter notebooks towards production-quality code?](https://laszlo.substack.com/p/how-can-a-data-scientist-refactor) [I appreciate this might be terse. Comment, open an issue, vote on it if you would like to have a detailed discussion on this - Laszlo]
 54 | 
 55 | This is the essence of the importance of Clean Architecture and code reuse. Every code will be used in two different context: test and "production" by injecting different dependencies. Because the same code runs in both places there is no time spent on translating from one to another. The test setup should reflect production context as close as possible so when a test fail or pass you can think that the same will happen in production as well. This speed up iteration because you can freely experiment in the test context and only deploy code into "production" when you are convinced it is doing what you think it should do. But it is the same code, so deployment is effortless.
 56 | ### Step 09: Merge passenger data with targets
 57 | 
 58 | - Remove the `get_targets()` interface
 59 | - Replace the query in `SqlLoader`
 60 | - Remove any code related to `targets`
 61 | 
 62 | This is a step to prepare to build the "domain data model". The Titanic model is about survival of her passengers. For the code to align this domain the concept of "passengers" need to be introduced (as a class/object). A passenger either survived or not, it's an attribute of the passenger and it need to be implemented like that.
 63 | 
 64 | This is a critical part of the code quality journey and building better systems. Once you introduce these concepts your code will depend directly on the business problem you are solving not the various representations the data is stored (pandas, numpy, csv, etc). I wrote about this many times on my blog:
 65 | 
 66 | - [3 Ways Domain Data Models help Data Science Projects](https://laszlo.substack.com/p/3-ways-domain-data-models-help-data)
 67 | - [Clean Architecture: How to structure your ML projects to reduce technical debt](https://laszlo.substack.com/p/slides-for-my-talk-at-pydata-london)
 68 | - [How did I change my mind about dataclasses in ML projects?](https://laszlo.substack.com/p/how-did-i-change-my-mind-about-dataclasses)
 69 | ### Step 10: Create Passenger class
 70 | 
 71 | - Import `BaseModel` from `pydantic`
 72 | - Create the class by inspecting:
 73 |   - The `dtype` of columns used in `df`
 74 |   - The actual values in `df`
 75 |   - The names of the columns that are used later in the code
 76 | 
 77 | There is really no shortcut here. In a "real" project defining this class would be the first step, but in legacy you need to deal with it later. The benefit of domain data objects is that any time you use them you can assume they fulfill a set of assumptions. These can be made explicit with `pydantic's` validators. One goal of the refactoring is to make sure that most interaction between classes happen with domain data objects. This simplifies structuring the project, any future data related change has a well defined place to happen.
 78 | ### Step 11: Create domain data object based data loader
 79 | 
 80 | - Create `PassengerLoader` class that takes a "real"/"old" loader
 81 | - In its `get_passengers` function, load the data from the loader and create the `Passenger` objects
 82 | - Copy the data transformations from `TitanicModelCreator.run()`
 83 | 
 84 | Take a look at how the `rare_titles` variable is used in `run()`. After scanning the entire dataset for titles, the ones that appear less than 10 times are selected. This can be done only if you have access to the entire database and this list needs to be maintained. This can cause problems in a real setting when the above operation is too difficult to do. For example if you have millions of items or a constant stream. This kind of dependencies are common in legacy code and one of the goals of refactoring is to identify these and make explicit. Here we will use a constant but in a productionised environment this might need a whole separate service.
 85 | 
 86 | `PassengerLoader` implements the Factory Design Pattern. Factories are classes that create other classes, they are a type of adapter that hides away where the data is coming from and how is it stored and return only abstract domain relevant classes that you can use downstream. Factories are one of two (later increased to three) fundamentally relevant Design Patterns for Data Science workflows:
 87 | 
 88 | - [You only need 2 Design Patterns to improve the quality of your code in a data science project](https://laszlo.substack.com/p/you-only-need-2-design-patterns-to)
 89 | - [Clean Architecture: How to structure your ML projects to reduce technical debt](https://laszlo.substack.com/p/slides-for-my-talk-at-pydata-london)
 90 | ### Step 12: Remove any data that is not explicitly needed
 91 | 
 92 | - Update the query in `SqlLoader` to only retrieve the columns that will be used for the model's input
 93 | 
 94 | Simplifying down to the minimum is a goal of refactoring. Anything that is not explicitly needed should be removed. If the requirements change they can be added back again. For example the `ticket` column is in `df` but it is never used again in the program. Remove it.
 95 | ### Step 13: Use Passenger objects in the program
 96 | 
 97 | - Add `PassengerLoader` to `main` and `test_main`
 98 | - Add the `RARE_TITLES` constant
 99 | - Convert the classes back into the `df` dataframe with `passenger.dict()`
100 | 
101 | It is very important to do refactoring incrementally. Any change should be small enough that if the tests fail the source can be found quickly. So for now we stop at using the new loader but do not change anything else.
102 | ### Step 14: Separate training and evaluation functions
103 | 
104 | - Move all code related to evaluation (variables that has `_test_` in their name) into one group
105 | 
106 | After creating the model first it is trained, then it is evaluated on the training data, then it is evaluated on the testing data. These should be separated from each other into their own logical place. This will prepare to move them into an actually separated place.
107 | ### Step 15: Create `TitanicModel` class
108 | 
109 | - Create a class that has all the `sklearn` components as member variables
110 | - Instantiate these before the "Training" block
111 | - Use these instead of the local ones
112 | 
113 | The goal of the whole program is to create a model, despite this until now there was no single object describing this model. The next steps is to establish the concept of this model and what kind of services it is providing for `TitanicModelCreator`.
114 | ### Step 16: Passenger class based training and evaluation sets
115 | 
116 | - Create a function in `TitanicModelCreator` that splits the `passengers` stratified by the "targets" (namely if the passenger survived or not)
117 | - Refactor `X_train/X_test` to be created from these lists of passengers
118 | 
119 | Because `train_test_split` works on lists, we extract the pids and the targets from the classes and create the two sets from a mapping from pids to passenger classes.
120 | ### Step 17: Create input processing for `TitanicModel`
121 | 
122 | - Move code in `run()` from between instantiating `TitanicModel` and training (`model.predictor.fit`) to the `process_inputs` function of `TitanicModel`.
123 | - Introduce `self.trained` boolean
124 | - Based on `self.trained` either call the `transform` or `fit_transform` of the `sklearn` input processor functions
125 | 
126 | All the input transformation code happen twice. Once for training data once for evaluation data. While transforming the data is a responsibility of the model. This is a codesmell called "feature envy". `TitanicModelCreator` envies the functionality from `TitanicModel`. There will be several steps to resolve this. The resulting code will create a self contained model that can be shipped independetly from its creator.
127 | 
128 | ### Step 18: Move training into `TitanicModel`
129 | 
130 | - Use the same interface as `process_inputs` with `train()`
131 | - Process the data with `process_inputs` (just pass through the arguments)
132 | - Recreate the required targets with the mapping
133 | - Train the model and set the `trained` boolean to `True`
134 | ### Step 19: Move prediction to `TitanicModel`
135 | 
136 | - Create the `estimate` function
137 | - Call `proccess_inputs` and `predictor.predict` in it
138 | - Remove all evaluation input processing code
139 | - Call `estimate` from `run`
140 | 
141 | Because there was no separation of concerns the input processing code was duplicated and now that we moved it to its own location it can be removed.
142 | 
143 | `X_train_processed` and `X_test_processed` do not exist any more so to pass the tests they need to be recreated. This is a good point to think about why this is necessary and find a different way to test behaviour. To keep the project short we set aside this but this would be a good place to introduce more tests.
144 | ### Step 20: Save model and move tests to custom model savers
145 | 
146 | - Create `ModelSaver` that has a `save_model` interface that accepts a model and a result object
147 | - Pickle the model and the result to a file
148 | - Create `TestModelSaver` that has the same interface
149 | - Move the testing code to the `save_model` functon
150 | - Add `model_saver` property to `TitanicModelCreator` and call it after the evaluation code
151 | - Add an instance of `ModelSaver` and `TestModelSaver` respectively in `main` and `test_main` to the construction of `TitanicModelCreator`
152 | 
153 | Currently `TitanicModelCreator` contains its own testing, while this is intended to run in production. It also have no way to save the model. We will introduce the concept of `ModelSaver` here, anything that need to be preserved after the model training need to be passed to this class.
154 | 
155 | We will also move testing into a specific `TestModelSaver` that will instead of saving the model, will run the tests that were otherwise be in `run()`. This way the same code can run in production and in testing without change.
156 | ### Step 21: Enable training of different models
157 | 
158 | - Add `model` property to `TitanicModelCreator` and use in `run()` that instead of the local `TitanicModel` instance.
159 | - Add `TitanicModel` instantiation to the creation of `TitanicModelCreator` in both `main` and `test_main`
160 | - Expose parts of `TitanicModel` (predictor, processing parameter)
161 | 
162 | At this point the refactoring is pretty much finished. This last step enables the creation of different models. Use existing implementations as templates to create new shell scripts, main functions (contexts) for each experiment that uses new Loaders to create new datasets. Write different test context to make sure the changes you do are as intended.As more experiments emerge, you will see patterns and opportunities to extract common behaviour from similar implementation while still maintaining validity through thes tests. This allows restructuring your code on the fly and find out what is the most convenient architecture for your system. Most problems in these systems are unforeseeable, there is no possibility to figure out the best structure before you start implementation. This require a workflow that enables radical changes even at later stages of the project. Clean Architecture, end-to-end testing and maintaining code quality provides exactly this feature at very low effort.
163 | 
164 | Next steps:
165 | 
166 | - Use different data:
167 |   - Update `SqlLoader` to retrieve different data
168 |   - Update `Passenger` class to contain this new data
169 |   - Update `PassengerLoader` class to process this new data into the classes
170 |   - Update `process_inputs` to create features out of this new data
171 | - Use different features
172 |   - Update `process_inputs` in `TitanicModel`, expose parameters as needed
173 | - Use different model:
174 |   - Use different `predictor` in `TitanicModel`
175 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # CQ4DS Notebook Sklearn Refactoring Exercise
 2 | 
 3 | This step-by-step programme demonstrates how to refactor a Data Science project from notebooks to well-formed classes and scripts.  
 4 | 
 5 | ### The project:
 6 | 
 7 | The notebook demonstrates a typical setup of a data science project:
 8 | 
 9 | - Connects to a database (included in the repository as an SQLite file).
10 | - Gathers some data (the classic Titanic example).
11 | - Does feature engineering.
12 | - Fits a model to estimate survival (sklearn's LogisticRegression).
13 | - Evaluates the model.
14 | 
15 | ### Context, vision, 
16 | 
17 | I wrote a detailed post on the concepts, strategy and big picture thinking. I recommend reading it parallel with the instructions and the steps in the pull request while you are doing the exercises: 
18 | 
19 | [https://laszlo.substack.com/p/refactoring-the-titanic](https://laszlo.substack.com/p/refactoring-the-titanic)
20 | 
21 | ### Refactoring
22 | 
23 | The programme demonstrates how to improve code quality, increase agility and prepare for unforeseen changes in a real-world project (see `INSTRUCTIONS.md` for reference reading). You will perform the following steps:
24 | 
25 | - Create end-to-end functional testing
26 | - Create shell scripts, command line interfaces, virtual environments
27 | - Decouple from external sources (the Database)
28 | - Refactor with simple Design Patterns (Adapter/Factory/Strategy)
29 | - Improve readability
30 | - Reduce code duplication
31 | 
32 | ### Howto:
33 | 
34 | - Clone the repository.
35 | - Create a virtual environment with `make_venv.sh`.
36 | - Follow the instructions in `INSTRUCTIONS.md`.
37 | - Run the tests with `titanic_model.sh`.
38 | - Check the diffs of the pull request's steps to verify your progress.
39 | 
40 | ### Community:
41 | 
42 | For more information and help, join our interactive self-help Code Quality for Data Science (CQ4DS) community on discord: [https://discord.gg/8uUZNMCad2](https://discord.gg/8uUZNMCad2).
43 | 
44 | Original project content from and inspired by: [https://jaketae.github.io/study/sklearn-pipeline/](https://jaketae.github.io/study/sklearn-pipeline/)
45 | 


--------------------------------------------------------------------------------
/Step00/README.md:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/xLaszlo/CQ4DS-notebook-sklearn-refactoring-exercise/16059caade85caeb6f4ccf7e492e1d67f87d28a5/Step00/README.md


--------------------------------------------------------------------------------
/Step00/Slide0_Notebook.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "code",
  5 |    "execution_count": 1,
  6 |    "metadata": {},
  7 |    "outputs": [],
  8 |    "source": [
  9 |     "import os\n",
 10 |     "import pickle\n",
 11 |     "import numpy as np\n",
 12 |     "import pandas as pd\n",
 13 |     "from collections import Counter\n",
 14 |     "from sqlalchemy import create_engine\n",
 15 |     "\n",
 16 |     "from sklearn.model_selection import train_test_split\n",
 17 |     "from sklearn.linear_model import LogisticRegression\n",
 18 |     "from sklearn.preprocessing import RobustScaler\n",
 19 |     "from sklearn.preprocessing import OneHotEncoder\n",
 20 |     "from sklearn.impute import KNNImputer\n",
 21 |     "from sklearn.metrics import confusion_matrix"
 22 |    ]
 23 |   },
 24 |   {
 25 |    "cell_type": "code",
 26 |    "execution_count": 2,
 27 |    "metadata": {},
 28 |    "outputs": [
 29 |     {
 30 |      "data": {
 31 |       "text/html": [
 32 |        "<div>\n",
 33 |        "<style scoped>\n",
 34 |        "    .dataframe tbody tr th:only-of-type {\n",
 35 |        "        vertical-align: middle;\n",
 36 |        "    }\n",
 37 |        "\n",
 38 |        "    .dataframe tbody tr th {\n",
 39 |        "        vertical-align: top;\n",
 40 |        "    }\n",
 41 |        "\n",
 42 |        "    .dataframe thead th {\n",
 43 |        "        text-align: right;\n",
 44 |        "    }\n",
 45 |        "</style>\n",
 46 |        "<table border=\"1\" class=\"dataframe\">\n",
 47 |        "  <thead>\n",
 48 |        "    <tr style=\"text-align: right;\">\n",
 49 |        "      <th></th>\n",
 50 |        "      <th>type</th>\n",
 51 |        "      <th>name</th>\n",
 52 |        "      <th>tbl_name</th>\n",
 53 |        "      <th>rootpage</th>\n",
 54 |        "      <th>sql</th>\n",
 55 |        "    </tr>\n",
 56 |        "  </thead>\n",
 57 |        "  <tbody>\n",
 58 |        "    <tr>\n",
 59 |        "      <th>0</th>\n",
 60 |        "      <td>table</td>\n",
 61 |        "      <td>tbl_passengers</td>\n",
 62 |        "      <td>tbl_passengers</td>\n",
 63 |        "      <td>2</td>\n",
 64 |        "      <td>CREATE TABLE tbl_passengers (\\n\\tpid BIGINT, \\...</td>\n",
 65 |        "    </tr>\n",
 66 |        "    <tr>\n",
 67 |        "      <th>1</th>\n",
 68 |        "      <td>table</td>\n",
 69 |        "      <td>tbl_targets</td>\n",
 70 |        "      <td>tbl_targets</td>\n",
 71 |        "      <td>35</td>\n",
 72 |        "      <td>CREATE TABLE tbl_targets (\\n\\tpid BIGINT, \\n\\t...</td>\n",
 73 |        "    </tr>\n",
 74 |        "  </tbody>\n",
 75 |        "</table>\n",
 76 |        "</div>"
 77 |       ],
 78 |       "text/plain": [
 79 |        "    type            name        tbl_name  rootpage  \\\n",
 80 |        "0  table  tbl_passengers  tbl_passengers         2   \n",
 81 |        "1  table     tbl_targets     tbl_targets        35   \n",
 82 |        "\n",
 83 |        "                                                 sql  \n",
 84 |        "0  CREATE TABLE tbl_passengers (\\n\\tpid BIGINT, \\...  \n",
 85 |        "1  CREATE TABLE tbl_targets (\\n\\tpid BIGINT, \\n\\t...  "
 86 |       ]
 87 |      },
 88 |      "execution_count": 2,
 89 |      "metadata": {},
 90 |      "output_type": "execute_result"
 91 |     }
 92 |    ],
 93 |    "source": [
 94 |     "engine = create_engine('sqlite:///../data/titanic.db')\n",
 95 |     "sqlite_connection = engine.connect()\n",
 96 |     "pd.read_sql('SELECT * FROM sqlite_schema WHERE type=\"table\"', con=sqlite_connection)"
 97 |    ]
 98 |   },
 99 |   {
100 |    "cell_type": "code",
101 |    "execution_count": 3,
102 |    "metadata": {},
103 |    "outputs": [],
104 |    "source": [
105 |     "np.random.seed(42)\n",
106 |     "\n",
107 |     "df = pd.read_sql('SELECT * FROM tbl_passengers', con=sqlite_connection)\n",
108 |     "\n",
109 |     "targets = pd.read_sql('SELECT * FROM tbl_targets', con=sqlite_connection)\n",
110 |     "\n",
111 |     "# df, targets = fetch_openml(\"titanic\", version=1, as_frame=True, return_X_y=True)\n",
112 |     "\n",
113 |     "# parch = Parents/Children, sibsp = Siblings/Spouses\n",
114 |     "df['family_size'] = df['parch'] + df['sibsp']\n",
115 |     "df['is_alone'] = [1 if family_size==1 else 0 for family_size in df['family_size']]\n",
116 |     "\n",
117 |     "df['title'] = [name.split(',')[1].split('.')[0].strip() for name in df['name']]\n",
118 |     "rare_titles = {k for k,v in Counter(df['title']).items() if v < 10}\n",
119 |     "df['title'] = ['rare' if title in rare_titles else title for title in df['title']]\n",
120 |     "\n",
121 |     "df = df[[\n",
122 |     "    'pclass', 'sex', 'age', 'ticket', 'family_size',\n",
123 |     "    'fare', 'embarked', 'is_alone', 'title'\n",
124 |     "]]\n",
125 |     "\n",
126 |     "targets = [int(v) for v in targets['is_survived']]\n",
127 |     "X_train, X_test, y_train, y_test = train_test_split(df, targets, stratify=targets, test_size=0.2)"
128 |    ]
129 |   },
130 |   {
131 |    "cell_type": "code",
132 |    "execution_count": 4,
133 |    "metadata": {},
134 |    "outputs": [
135 |     {
136 |      "data": {
137 |       "text/html": [
138 |        "<div>\n",
139 |        "<style scoped>\n",
140 |        "    .dataframe tbody tr th:only-of-type {\n",
141 |        "        vertical-align: middle;\n",
142 |        "    }\n",
143 |        "\n",
144 |        "    .dataframe tbody tr th {\n",
145 |        "        vertical-align: top;\n",
146 |        "    }\n",
147 |        "\n",
148 |        "    .dataframe thead th {\n",
149 |        "        text-align: right;\n",
150 |        "    }\n",
151 |        "</style>\n",
152 |        "<table border=\"1\" class=\"dataframe\">\n",
153 |        "  <thead>\n",
154 |        "    <tr style=\"text-align: right;\">\n",
155 |        "      <th></th>\n",
156 |        "      <th>pclass</th>\n",
157 |        "      <th>sex</th>\n",
158 |        "      <th>age</th>\n",
159 |        "      <th>ticket</th>\n",
160 |        "      <th>family_size</th>\n",
161 |        "      <th>fare</th>\n",
162 |        "      <th>embarked</th>\n",
163 |        "      <th>is_alone</th>\n",
164 |        "      <th>title</th>\n",
165 |        "    </tr>\n",
166 |        "  </thead>\n",
167 |        "  <tbody>\n",
168 |        "    <tr>\n",
169 |        "      <th>0</th>\n",
170 |        "      <td>1.0</td>\n",
171 |        "      <td>female</td>\n",
172 |        "      <td>29.0000</td>\n",
173 |        "      <td>24160</td>\n",
174 |        "      <td>0.0</td>\n",
175 |        "      <td>211.3375</td>\n",
176 |        "      <td>S</td>\n",
177 |        "      <td>0</td>\n",
178 |        "      <td>Miss</td>\n",
179 |        "    </tr>\n",
180 |        "    <tr>\n",
181 |        "      <th>1</th>\n",
182 |        "      <td>1.0</td>\n",
183 |        "      <td>male</td>\n",
184 |        "      <td>0.9167</td>\n",
185 |        "      <td>113781</td>\n",
186 |        "      <td>3.0</td>\n",
187 |        "      <td>151.5500</td>\n",
188 |        "      <td>S</td>\n",
189 |        "      <td>0</td>\n",
190 |        "      <td>Master</td>\n",
191 |        "    </tr>\n",
192 |        "    <tr>\n",
193 |        "      <th>2</th>\n",
194 |        "      <td>1.0</td>\n",
195 |        "      <td>female</td>\n",
196 |        "      <td>2.0000</td>\n",
197 |        "      <td>113781</td>\n",
198 |        "      <td>3.0</td>\n",
199 |        "      <td>151.5500</td>\n",
200 |        "      <td>S</td>\n",
201 |        "      <td>0</td>\n",
202 |        "      <td>Miss</td>\n",
203 |        "    </tr>\n",
204 |        "  </tbody>\n",
205 |        "</table>\n",
206 |        "</div>"
207 |       ],
208 |       "text/plain": [
209 |        "   pclass     sex      age  ticket  family_size      fare embarked  is_alone  \\\n",
210 |        "0     1.0  female  29.0000   24160          0.0  211.3375        S         0   \n",
211 |        "1     1.0    male   0.9167  113781          3.0  151.5500        S         0   \n",
212 |        "2     1.0  female   2.0000  113781          3.0  151.5500        S         0   \n",
213 |        "\n",
214 |        "    title  \n",
215 |        "0    Miss  \n",
216 |        "1  Master  \n",
217 |        "2    Miss  "
218 |       ]
219 |      },
220 |      "execution_count": 4,
221 |      "metadata": {},
222 |      "output_type": "execute_result"
223 |     }
224 |    ],
225 |    "source": [
226 |     "df[:3]"
227 |    ]
228 |   },
229 |   {
230 |    "cell_type": "code",
231 |    "execution_count": 5,
232 |    "metadata": {},
233 |    "outputs": [],
234 |    "source": [
235 |     "X_train_categorical = X_train[['embarked', 'sex', 'pclass', 'title', 'is_alone']]\n",
236 |     "X_test_categorical = X_test[['embarked', 'sex', 'pclass', 'title', 'is_alone']]\n",
237 |     "\n",
238 |     "one_hot_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False).fit(X_train_categorical)\n",
239 |     "X_train_categorical_one_hot = one_hot_encoder.transform(X_train_categorical)\n",
240 |     "X_test_categorical_one_hot = one_hot_encoder.transform(X_test_categorical)"
241 |    ]
242 |   },
243 |   {
244 |    "cell_type": "code",
245 |    "execution_count": 6,
246 |    "metadata": {},
247 |    "outputs": [],
248 |    "source": [
249 |     "X_train_numerical = X_train[['age', 'fare', 'family_size']]\n",
250 |     "X_test_numerical = X_test[['age', 'fare', 'family_size']]\n",
251 |     "knn_imputer = KNNImputer(n_neighbors=5).fit(X_train_numerical)\n",
252 |     "X_train_numerical_imputed = knn_imputer.transform(X_train_numerical)\n",
253 |     "X_test_numerical_imputed = knn_imputer.transform(X_test_numerical)"
254 |    ]
255 |   },
256 |   {
257 |    "cell_type": "code",
258 |    "execution_count": 7,
259 |    "metadata": {},
260 |    "outputs": [],
261 |    "source": [
262 |     "robust_scaler = RobustScaler().fit(X_train_numerical_imputed)\n",
263 |     "X_train_numerical_imputed_scaled = robust_scaler.transform(X_train_numerical_imputed)\n",
264 |     "X_test_numerical_imputed_scaled = robust_scaler.transform(X_test_numerical_imputed)"
265 |    ]
266 |   },
267 |   {
268 |    "cell_type": "code",
269 |    "execution_count": 8,
270 |    "metadata": {},
271 |    "outputs": [],
272 |    "source": [
273 |     "X_train_processed = np.hstack((X_train_categorical_one_hot, X_train_numerical_imputed_scaled))\n",
274 |     "X_test_processed = np.hstack((X_test_categorical_one_hot, X_test_numerical_imputed_scaled))"
275 |    ]
276 |   },
277 |   {
278 |    "cell_type": "code",
279 |    "execution_count": 9,
280 |    "metadata": {},
281 |    "outputs": [],
282 |    "source": [
283 |     "model = LogisticRegression(random_state=0).fit(X_train_processed, y_train)\n",
284 |     "y_train_estimation = model.predict(X_train_processed)\n",
285 |     "y_test_estimation = model.predict(X_test_processed)"
286 |    ]
287 |   },
288 |   {
289 |    "cell_type": "code",
290 |    "execution_count": 10,
291 |    "metadata": {},
292 |    "outputs": [],
293 |    "source": [
294 |     "cm_train = confusion_matrix(y_train, y_train_estimation)"
295 |    ]
296 |   },
297 |   {
298 |    "cell_type": "code",
299 |    "execution_count": 11,
300 |    "metadata": {},
301 |    "outputs": [],
302 |    "source": [
303 |     "cm_test = confusion_matrix(y_test, y_test_estimation)"
304 |    ]
305 |   },
306 |   {
307 |    "cell_type": "code",
308 |    "execution_count": 12,
309 |    "metadata": {},
310 |    "outputs": [
311 |     {
312 |      "data": {
313 |       "text/plain": [
314 |        "array([[553,  94],\n",
315 |        "       [107, 293]])"
316 |       ]
317 |      },
318 |      "execution_count": 12,
319 |      "metadata": {},
320 |      "output_type": "execute_result"
321 |     }
322 |    ],
323 |    "source": [
324 |     "cm_train"
325 |    ]
326 |   },
327 |   {
328 |    "cell_type": "code",
329 |    "execution_count": 13,
330 |    "metadata": {},
331 |    "outputs": [
332 |     {
333 |      "data": {
334 |       "text/plain": [
335 |        "array([[142,  20],\n",
336 |        "       [ 22,  78]])"
337 |       ]
338 |      },
339 |      "execution_count": 13,
340 |      "metadata": {},
341 |      "output_type": "execute_result"
342 |     }
343 |    ],
344 |    "source": [
345 |     "cm_test"
346 |    ]
347 |   },
348 |   {
349 |    "cell_type": "code",
350 |    "execution_count": 14,
351 |    "metadata": {},
352 |    "outputs": [
353 |     {
354 |      "name": "stdout",
355 |      "output_type": "stream",
356 |      "text": [
357 |       "../data/cm_test.pkl test passed\n",
358 |       "../data/cm_train.pkl test passed\n",
359 |       "../data/X_train_processed.pkl test passed\n",
360 |       "../data/X_test_processed.pkl test passed\n"
361 |      ]
362 |     }
363 |    ],
364 |    "source": [
365 |     "def do_test(filename, data):\n",
366 |     "    if not os.path.isfile(filename):\n",
367 |     "        pickle.dump(data, open(filename, 'wb'))\n",
368 |     "    truth = pickle.load(open(filename, 'rb'))\n",
369 |     "    try:\n",
370 |     "        np.testing.assert_almost_equal(data, truth)\n",
371 |     "        print(f'{filename} test passed')\n",
372 |     "    except AssertionError as ex:\n",
373 |     "        print(f'{filename} test failed {ex}')\n",
374 |     "    \n",
375 |     "do_test('../data/cm_test.pkl', cm_test)\n",
376 |     "do_test('../data/cm_train.pkl', cm_train)\n",
377 |     "do_test('../data/X_train_processed.pkl', X_train_processed)\n",
378 |     "do_test('../data/X_test_processed.pkl', X_test_processed)\n"
379 |    ]
380 |   },
381 |   {
382 |    "cell_type": "code",
383 |    "execution_count": 15,
384 |    "metadata": {},
385 |    "outputs": [
386 |     {
387 |      "name": "stdout",
388 |      "output_type": "stream",
389 |      "text": [
390 |       "../data/df.pkl pandas test passed\n"
391 |      ]
392 |     }
393 |    ],
394 |    "source": [
395 |     "def do_pandas_test(filename, data):\n",
396 |     "    if not os.path.isfile(filename):\n",
397 |     "        data.to_pickle(filename)\n",
398 |     "    truth = pd.read_pickle(filename)\n",
399 |     "    try:\n",
400 |     "        pd.testing.assert_frame_equal(data, truth)\n",
401 |     "        print(f'{filename} pandas test passed')\n",
402 |     "    except AssertionError as ex:\n",
403 |     "        print(f'{filename} pandas test failed {ex}')\n",
404 |     "        \n",
405 |     "# df['title'] = ['asd' for v in df['title']]\n",
406 |     "do_pandas_test('../data/df.pkl', df)"
407 |    ]
408 |   },
409 |   {
410 |    "cell_type": "code",
411 |    "execution_count": 16,
412 |    "metadata": {},
413 |    "outputs": [
414 |     {
415 |      "data": {
416 |       "text/plain": [
417 |        "{'Capt',\n",
418 |        " 'Col',\n",
419 |        " 'Don',\n",
420 |        " 'Dona',\n",
421 |        " 'Dr',\n",
422 |        " 'Jonkheer',\n",
423 |        " 'Lady',\n",
424 |        " 'Major',\n",
425 |        " 'Mlle',\n",
426 |        " 'Mme',\n",
427 |        " 'Ms',\n",
428 |        " 'Rev',\n",
429 |        " 'Sir',\n",
430 |        " 'the Countess'}"
431 |       ]
432 |      },
433 |      "execution_count": 16,
434 |      "metadata": {},
435 |      "output_type": "execute_result"
436 |     }
437 |    ],
438 |    "source": [
439 |     "rare_titles"
440 |    ]
441 |   },
442 |   {
443 |    "cell_type": "code",
444 |    "execution_count": null,
445 |    "metadata": {},
446 |    "outputs": [],
447 |    "source": []
448 |   }
449 |  ],
450 |  "metadata": {
451 |   "kernelspec": {
452 |    "display_name": "Python 3 (ipykernel)",
453 |    "language": "python",
454 |    "name": "python3"
455 |   },
456 |   "language_info": {
457 |    "codemirror_mode": {
458 |     "name": "ipython",
459 |     "version": 3
460 |    },
461 |    "file_extension": ".py",
462 |    "mimetype": "text/x-python",
463 |    "name": "python",
464 |    "nbconvert_exporter": "python",
465 |    "pygments_lexer": "ipython3",
466 |    "version": "3.8.10"
467 |   },
468 |   "vscode": {
469 |    "interpreter": {
470 |     "hash": "774712da715a3086605d6bf08e7144a3a7e717b0d5585da12e288357dd4c8f07"
471 |    }
472 |   }
473 |  },
474 |  "nbformat": 4,
475 |  "nbformat_minor": 4
476 | }
477 | 


--------------------------------------------------------------------------------
/Step01/.gitkeep:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/xLaszlo/CQ4DS-notebook-sklearn-refactoring-exercise/16059caade85caeb6f4ccf7e492e1d67f87d28a5/Step01/.gitkeep


--------------------------------------------------------------------------------
/Step01/README00.md:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/xLaszlo/CQ4DS-notebook-sklearn-refactoring-exercise/16059caade85caeb6f4ccf7e492e1d67f87d28a5/Step01/README00.md


--------------------------------------------------------------------------------
/Step02/README01.md:
--------------------------------------------------------------------------------
1 | ### Step 01: Project setup
2 | 
3 | - Write script to create virtual environment
4 | - Write the first `requirements.txt`
5 | 
6 | You can select a different `setuptools` version or pin the package versions.
7 | 


--------------------------------------------------------------------------------
/Step02/make_venv.sh:
--------------------------------------------------------------------------------
1 | python3.8 -m venv .venv
2 | source .venv/bin/activate
3 | pip3 install --upgrade pip
4 | pip3 install setuptools==57.1.0
5 | pip3 install wheel
6 | pip3 install -r requirements.txt
7 | 


--------------------------------------------------------------------------------
/Step02/requirements.txt:
--------------------------------------------------------------------------------
1 | typer
2 | scikit-learn
3 | jupyter
4 | pandas
5 | numpy
6 | sqlalchemy
7 | pydantic
8 | 


--------------------------------------------------------------------------------
/Step03/README02.md:
--------------------------------------------------------------------------------
1 | ### Step 02: Code setup
2 | 
3 | - Write python script stub with `typer`
4 | - Write shell script to execute the python script
5 | 
6 | `Typer` is an amazing tool that turns any python script into shell scripts. Here we use it for future-proofing because at the moment there are no CLI arguments.
7 | 
8 | The program will be defined in a class that is instantiated by the `main()` function and call its main `run()` entry point. The `main()` function will be called by `typer` to pass any CLI parameters. This setup will allow us to create a "plugin" architecture and construct different behaviour (e.g.: normal, test, production) in different main functions. This is a form of "Clean Architecture" where the code (the class) is independent of the infrastructure that calls it (`main()`) more on this: [Clean Architecture: How to structure your ML projects to reduce technical debt (PyData London 2022)](https://laszlo.substack.com/p/slides-for-my-talk-at-pydata-london).
9 | 


--------------------------------------------------------------------------------
/Step03/make_venv.sh:
--------------------------------------------------------------------------------
1 | python3.8 -m venv .venv
2 | source .venv/bin/activate
3 | pip3 install --upgrade pip
4 | pip3 install setuptools==57.1.0
5 | pip3 install wheel
6 | pip3 install -r requirements.txt
7 | 


--------------------------------------------------------------------------------
/Step03/requirements.txt:
--------------------------------------------------------------------------------
1 | typer
2 | scikit-learn
3 | jupyter
4 | pandas
5 | numpy
6 | sqlalchemy
7 | pydantic
8 | 


--------------------------------------------------------------------------------
/Step03/titanic_model.py:
--------------------------------------------------------------------------------
 1 | import typer
 2 | 
 3 | 
 4 | class TitanicModelCreator:
 5 |     def __init__(self):
 6 |         pass
 7 | 
 8 |     def run(self):
 9 |         print('Hello World!')
10 | 
11 | 
12 | def main(param: str = 'pass'):
13 |     titanic_model_creator = TitanicModelCreator()
14 |     titanic_model_creator.run()
15 | 
16 | 
17 | if __name__ == "__main__":
18 |     typer.run(main)
19 | 


--------------------------------------------------------------------------------
/Step03/titanic_model.sh:
--------------------------------------------------------------------------------
1 | python3.8 ./titanic_model.py $1
2 | 


--------------------------------------------------------------------------------
/Step04/README03.md:
--------------------------------------------------------------------------------
1 | ### Step 03: Move code out of the notebook
2 | 
3 | - Copy-paste everything into the `run()` function
4 | 
5 | First step is to get started. There will be plenty of steps to structure the code better.
6 | 


--------------------------------------------------------------------------------
/Step04/make_venv.sh:
--------------------------------------------------------------------------------
1 | python3.8 -m venv .venv
2 | source .venv/bin/activate
3 | pip3 install --upgrade pip
4 | pip3 install setuptools==57.1.0
5 | pip3 install wheel
6 | pip3 install -r requirements.txt
7 | 


--------------------------------------------------------------------------------
/Step04/requirements.txt:
--------------------------------------------------------------------------------
1 | typer
2 | scikit-learn
3 | jupyter
4 | pandas
5 | numpy
6 | sqlalchemy
7 | pydantic
8 | 


--------------------------------------------------------------------------------
/Step04/titanic_model.py:
--------------------------------------------------------------------------------
  1 | import typer
  2 | import numpy as np
  3 | import pandas as pd
  4 | from collections import Counter
  5 | from sqlalchemy import create_engine
  6 | 
  7 | from sklearn.model_selection import train_test_split
  8 | from sklearn.linear_model import LogisticRegression
  9 | from sklearn.preprocessing import RobustScaler
 10 | from sklearn.preprocessing import OneHotEncoder
 11 | from sklearn.impute import KNNImputer
 12 | from sklearn.metrics import confusion_matrix
 13 | 
 14 | 
 15 | class TitanicModelCreator:
 16 |     def __init__(self):
 17 |         pass
 18 | 
 19 |     def run(self):
 20 |         engine = create_engine('sqlite:///../data/titanic.db')
 21 |         sqlite_connection = engine.connect()
 22 |         pd.read_sql(
 23 |             'SELECT * FROM sqlite_schema WHERE type="table"', con=sqlite_connection
 24 |         )
 25 |         np.random.seed(42)
 26 | 
 27 |         df = pd.read_sql('SELECT * FROM tbl_passengers', con=sqlite_connection)
 28 | 
 29 |         targets = pd.read_sql('SELECT * FROM tbl_targets', con=sqlite_connection)
 30 | 
 31 |         # df, targets = fetch_openml("titanic", version=1, as_frame=True, return_X_y=True)
 32 | 
 33 |         # parch = Parents/Children, sibsp = Siblings/Spouses
 34 |         df['family_size'] = df['parch'] + df['sibsp']
 35 |         df['is_alone'] = [
 36 |             1 if family_size == 1 else 0 for family_size in df['family_size']
 37 |         ]
 38 | 
 39 |         df['title'] = [name.split(',')[1].split('.')[0].strip() for name in df['name']]
 40 |         rare_titles = {k for k, v in Counter(df['title']).items() if v < 10}
 41 |         df['title'] = [
 42 |             'rare' if title in rare_titles else title for title in df['title']
 43 |         ]
 44 | 
 45 |         df = df[
 46 |             [
 47 |                 'pclass',
 48 |                 'sex',
 49 |                 'age',
 50 |                 'ticket',
 51 |                 'family_size',
 52 |                 'fare',
 53 |                 'embarked',
 54 |                 'is_alone',
 55 |                 'title',
 56 |             ]
 57 |         ]
 58 | 
 59 |         targets = [int(v) for v in targets['is_survived']]
 60 |         X_train, X_test, y_train, y_test = train_test_split(
 61 |             df, targets, stratify=targets, test_size=0.2
 62 |         )
 63 | 
 64 |         X_train_categorical = X_train[
 65 |             ['embarked', 'sex', 'pclass', 'title', 'is_alone']
 66 |         ]
 67 |         X_test_categorical = X_test[['embarked', 'sex', 'pclass', 'title', 'is_alone']]
 68 | 
 69 |         one_hot_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False).fit(
 70 |             X_train_categorical
 71 |         )
 72 |         X_train_categorical_one_hot = one_hot_encoder.transform(X_train_categorical)
 73 |         X_test_categorical_one_hot = one_hot_encoder.transform(X_test_categorical)
 74 | 
 75 |         X_train_numerical = X_train[['age', 'fare', 'family_size']]
 76 |         X_test_numerical = X_test[['age', 'fare', 'family_size']]
 77 |         knn_imputer = KNNImputer(n_neighbors=5).fit(X_train_numerical)
 78 |         X_train_numerical_imputed = knn_imputer.transform(X_train_numerical)
 79 |         X_test_numerical_imputed = knn_imputer.transform(X_test_numerical)
 80 | 
 81 |         robust_scaler = RobustScaler().fit(X_train_numerical_imputed)
 82 |         X_train_numerical_imputed_scaled = robust_scaler.transform(
 83 |             X_train_numerical_imputed
 84 |         )
 85 |         X_test_numerical_imputed_scaled = robust_scaler.transform(
 86 |             X_test_numerical_imputed
 87 |         )
 88 | 
 89 |         X_train_processed = np.hstack(
 90 |             (X_train_categorical_one_hot, X_train_numerical_imputed_scaled)
 91 |         )
 92 |         X_test_processed = np.hstack(
 93 |             (X_test_categorical_one_hot, X_test_numerical_imputed_scaled)
 94 |         )
 95 | 
 96 |         model = LogisticRegression(random_state=0).fit(X_train_processed, y_train)
 97 |         y_train_estimation = model.predict(X_train_processed)
 98 |         y_test_estimation = model.predict(X_test_processed)
 99 | 
100 |         cm_train = confusion_matrix(y_train, y_train_estimation)
101 | 
102 |         cm_test = confusion_matrix(y_test, y_test_estimation)
103 | 
104 |         print('cm_train', cm_train)
105 |         print('cm_test', cm_test)
106 | 
107 | 
108 | def main(param: str = 'pass'):
109 |     titanic_model_creator = TitanicModelCreator()
110 |     titanic_model_creator.run()
111 | 
112 | 
113 | if __name__ == "__main__":
114 |     typer.run(main)
115 | 


--------------------------------------------------------------------------------
/Step04/titanic_model.sh:
--------------------------------------------------------------------------------
1 | python3.8 ./titanic_model.py $1
2 | 


--------------------------------------------------------------------------------
/Step05/README04.md:
--------------------------------------------------------------------------------
1 | ### Step 04: Move over the tests
2 | 
3 | - Copy paste tests and testing code from the notebook in `Step00` into the `run()` function.
4 | 
5 | This will implement very simple end-to-end testing which is less effort than unit testing given that the code is not really in a testable state. It caches the value of some variables and the next time you run the code it will compare it to this cache. If they match you didn't change the behaviour of the code with the last change. If your intentions was indeed to change the behaviour, verify from the output of the `AssertionError` that the changes are working as intended. If they are, delete the chaches and rerun the code to generate new reference values. The tests should be such that if they fail they produce meaningful differences. So instead of aggregate statistics (like an F1 score) test the datasets itself. That way even small changes won't go undetected. Once the code is refactored you can write different type of tests but that's a different story.
6 | 


--------------------------------------------------------------------------------
/Step05/make_venv.sh:
--------------------------------------------------------------------------------
1 | python3.8 -m venv .venv
2 | source .venv/bin/activate
3 | pip3 install --upgrade pip
4 | pip3 install setuptools==57.1.0
5 | pip3 install wheel
6 | pip3 install -r requirements.txt
7 | 


--------------------------------------------------------------------------------
/Step05/requirements.txt:
--------------------------------------------------------------------------------
1 | typer
2 | scikit-learn
3 | jupyter
4 | pandas
5 | numpy
6 | sqlalchemy
7 | pydantic
8 | 


--------------------------------------------------------------------------------
/Step05/titanic_model.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import pickle
  3 | import typer
  4 | import numpy as np
  5 | import pandas as pd
  6 | from collections import Counter
  7 | from sqlalchemy import create_engine
  8 | 
  9 | from sklearn.model_selection import train_test_split
 10 | from sklearn.linear_model import LogisticRegression
 11 | from sklearn.preprocessing import RobustScaler
 12 | from sklearn.preprocessing import OneHotEncoder
 13 | from sklearn.impute import KNNImputer
 14 | from sklearn.metrics import confusion_matrix
 15 | 
 16 | 
 17 | def do_test(filename, data):
 18 |     if not os.path.isfile(filename):
 19 |         pickle.dump(data, open(filename, 'wb'))
 20 |     truth = pickle.load(open(filename, 'rb'))
 21 |     try:
 22 |         np.testing.assert_almost_equal(data, truth)
 23 |         print(f'{filename} test passed')
 24 |     except AssertionError as ex:
 25 |         print(f'{filename} test failed {ex}')
 26 | 
 27 | 
 28 | def do_pandas_test(filename, data):
 29 |     if not os.path.isfile(filename):
 30 |         data.to_pickle(filename)
 31 |     truth = pd.read_pickle(filename)
 32 |     try:
 33 |         pd.testing.assert_frame_equal(data, truth)
 34 |         print(f'{filename} pandas test passed')
 35 |     except AssertionError as ex:
 36 |         print(f'{filename} pandas test failed {ex}')
 37 | 
 38 | 
 39 | class TitanicModelCreator:
 40 |     def __init__(self):
 41 |         pass
 42 | 
 43 |     def run(self):
 44 |         engine = create_engine('sqlite:///../data/titanic.db')
 45 |         sqlite_connection = engine.connect()
 46 |         pd.read_sql(
 47 |             'SELECT * FROM sqlite_schema WHERE type="table"', con=sqlite_connection
 48 |         )
 49 |         np.random.seed(42)
 50 | 
 51 |         df = pd.read_sql('SELECT * FROM tbl_passengers', con=sqlite_connection)
 52 | 
 53 |         targets = pd.read_sql('SELECT * FROM tbl_targets', con=sqlite_connection)
 54 | 
 55 |         # df, targets = fetch_openml("titanic", version=1, as_frame=True, return_X_y=True)
 56 | 
 57 |         # parch = Parents/Children, sibsp = Siblings/Spouses
 58 |         df['family_size'] = df['parch'] + df['sibsp']
 59 |         df['is_alone'] = [
 60 |             1 if family_size == 1 else 0 for family_size in df['family_size']
 61 |         ]
 62 | 
 63 |         df['title'] = [name.split(',')[1].split('.')[0].strip() for name in df['name']]
 64 |         rare_titles = {k for k, v in Counter(df['title']).items() if v < 10}
 65 |         df['title'] = [
 66 |             'rare' if title in rare_titles else title for title in df['title']
 67 |         ]
 68 | 
 69 |         df = df[
 70 |             [
 71 |                 'pclass',
 72 |                 'sex',
 73 |                 'age',
 74 |                 'ticket',
 75 |                 'family_size',
 76 |                 'fare',
 77 |                 'embarked',
 78 |                 'is_alone',
 79 |                 'title',
 80 |             ]
 81 |         ]
 82 | 
 83 |         targets = [int(v) for v in targets['is_survived']]
 84 |         X_train, X_test, y_train, y_test = train_test_split(
 85 |             df, targets, stratify=targets, test_size=0.2
 86 |         )
 87 | 
 88 |         X_train_categorical = X_train[
 89 |             ['embarked', 'sex', 'pclass', 'title', 'is_alone']
 90 |         ]
 91 |         X_test_categorical = X_test[['embarked', 'sex', 'pclass', 'title', 'is_alone']]
 92 | 
 93 |         one_hot_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False).fit(
 94 |             X_train_categorical
 95 |         )
 96 |         X_train_categorical_one_hot = one_hot_encoder.transform(X_train_categorical)
 97 |         X_test_categorical_one_hot = one_hot_encoder.transform(X_test_categorical)
 98 | 
 99 |         X_train_numerical = X_train[['age', 'fare', 'family_size']]
100 |         X_test_numerical = X_test[['age', 'fare', 'family_size']]
101 |         knn_imputer = KNNImputer(n_neighbors=5).fit(X_train_numerical)
102 |         X_train_numerical_imputed = knn_imputer.transform(X_train_numerical)
103 |         X_test_numerical_imputed = knn_imputer.transform(X_test_numerical)
104 | 
105 |         robust_scaler = RobustScaler().fit(X_train_numerical_imputed)
106 |         X_train_numerical_imputed_scaled = robust_scaler.transform(
107 |             X_train_numerical_imputed
108 |         )
109 |         X_test_numerical_imputed_scaled = robust_scaler.transform(
110 |             X_test_numerical_imputed
111 |         )
112 | 
113 |         X_train_processed = np.hstack(
114 |             (X_train_categorical_one_hot, X_train_numerical_imputed_scaled)
115 |         )
116 |         X_test_processed = np.hstack(
117 |             (X_test_categorical_one_hot, X_test_numerical_imputed_scaled)
118 |         )
119 | 
120 |         model = LogisticRegression(random_state=0).fit(X_train_processed, y_train)
121 |         y_train_estimation = model.predict(X_train_processed)
122 |         y_test_estimation = model.predict(X_test_processed)
123 | 
124 |         cm_train = confusion_matrix(y_train, y_train_estimation)
125 | 
126 |         cm_test = confusion_matrix(y_test, y_test_estimation)
127 | 
128 |         print('cm_train', cm_train)
129 |         print('cm_test', cm_test)
130 | 
131 |         do_test('../data/cm_test.pkl', cm_test)
132 |         do_test('../data/cm_train.pkl', cm_train)
133 |         do_test('../data/X_train_processed.pkl', X_train_processed)
134 |         do_test('../data/X_test_processed.pkl', X_test_processed)
135 | 
136 |         do_pandas_test('../data/df.pkl', df)
137 | 
138 | 
139 | def main(param: str = 'pass'):
140 |     titanic_model_creator = TitanicModelCreator()
141 |     titanic_model_creator.run()
142 | 
143 | 
144 | if __name__ == "__main__":
145 |     typer.run(main)
146 | 


--------------------------------------------------------------------------------
/Step05/titanic_model.sh:
--------------------------------------------------------------------------------
1 | python3.8 ./titanic_model.py $1
2 | 


--------------------------------------------------------------------------------
/Step06/README05.md:
--------------------------------------------------------------------------------
1 | ### Step 05: Decouple from the database
2 | 
3 | - Write `SQLLoader` class
4 | - Move database related code into it
5 | - Replace database calls with interface calls in `run()`
6 | 
7 | This is a typical example of the Adapter Pattern. Instead of directly calling the DB, we access it through an intermediary preparing for establishing "Loose Coupling" and "Dependency Inversion". In Clean Architecture the main code (the `run()` function) shouldn't know where the data is coming from, just what the data is. This will bring flexibility because this adapter can be replaced with another one that has the same interface but gets the data from a file. After that you can run your main code without a database which makes it more testable. More on this: [You only need 2 Design Patterns to improve the quality of your code in a data science project](https://laszlo.substack.com/p/you-only-need-2-design-patterns-to).
8 | 


--------------------------------------------------------------------------------
/Step06/make_venv.sh:
--------------------------------------------------------------------------------
1 | python3.8 -m venv .venv
2 | source .venv/bin/activate
3 | pip3 install --upgrade pip
4 | pip3 install setuptools==57.1.0
5 | pip3 install wheel
6 | pip3 install -r requirements.txt
7 | 


--------------------------------------------------------------------------------
/Step06/requirements.txt:
--------------------------------------------------------------------------------
1 | typer
2 | scikit-learn
3 | jupyter
4 | pandas
5 | numpy
6 | sqlalchemy
7 | pydantic
8 | 


--------------------------------------------------------------------------------
/Step06/titanic_model.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import pickle
  3 | import typer
  4 | import numpy as np
  5 | import pandas as pd
  6 | from collections import Counter
  7 | from sqlalchemy import create_engine
  8 | 
  9 | from sklearn.model_selection import train_test_split
 10 | from sklearn.linear_model import LogisticRegression
 11 | from sklearn.preprocessing import RobustScaler
 12 | from sklearn.preprocessing import OneHotEncoder
 13 | from sklearn.impute import KNNImputer
 14 | from sklearn.metrics import confusion_matrix
 15 | 
 16 | 
 17 | def do_test(filename, data):
 18 |     if not os.path.isfile(filename):
 19 |         pickle.dump(data, open(filename, 'wb'))
 20 |     truth = pickle.load(open(filename, 'rb'))
 21 |     try:
 22 |         np.testing.assert_almost_equal(data, truth)
 23 |         print(f'{filename} test passed')
 24 |     except AssertionError as ex:
 25 |         print(f'{filename} test failed {ex}')
 26 | 
 27 | 
 28 | def do_pandas_test(filename, data):
 29 |     if not os.path.isfile(filename):
 30 |         data.to_pickle(filename)
 31 |     truth = pd.read_pickle(filename)
 32 |     try:
 33 |         pd.testing.assert_frame_equal(data, truth)
 34 |         print(f'{filename} pandas test passed')
 35 |     except AssertionError as ex:
 36 |         print(f'{filename} pandas test failed {ex}')
 37 | 
 38 | 
 39 | class SqlLoader:
 40 |     def __init__(self, connection_string):
 41 |         engine = create_engine(connection_string)
 42 |         self.connection = engine.connect()
 43 | 
 44 |     def get_passengers(self):
 45 |         query = 'SELECT * FROM tbl_passengers'
 46 |         return pd.read_sql(query, con=self.connection)
 47 | 
 48 |     def get_targets(self):
 49 |         query = 'SELECT * FROM tbl_targets'
 50 |         return pd.read_sql(query, con=self.connection)
 51 | 
 52 | 
 53 | class TitanicModelCreator:
 54 |     def __init__(self):
 55 |         np.random.seed(42)
 56 | 
 57 |     def run(self):
 58 |         loader = SqlLoader(connection_string='sqlite:///../data/titanic.db')
 59 | 
 60 |         df = loader.get_passengers()
 61 |         targets = loader.get_targets()
 62 | 
 63 |         # parch = Parents/Children, sibsp = Siblings/Spouses
 64 |         df['family_size'] = df['parch'] + df['sibsp']
 65 |         df['is_alone'] = [
 66 |             1 if family_size == 1 else 0 for family_size in df['family_size']
 67 |         ]
 68 | 
 69 |         df['title'] = [name.split(',')[1].split('.')[0].strip() for name in df['name']]
 70 |         rare_titles = {k for k, v in Counter(df['title']).items() if v < 10}
 71 |         df['title'] = [
 72 |             'rare' if title in rare_titles else title for title in df['title']
 73 |         ]
 74 | 
 75 |         df = df[
 76 |             [
 77 |                 'pclass',
 78 |                 'sex',
 79 |                 'age',
 80 |                 'ticket',
 81 |                 'family_size',
 82 |                 'fare',
 83 |                 'embarked',
 84 |                 'is_alone',
 85 |                 'title',
 86 |             ]
 87 |         ]
 88 | 
 89 |         targets = [int(v) for v in targets['is_survived']]
 90 |         X_train, X_test, y_train, y_test = train_test_split(
 91 |             df, targets, stratify=targets, test_size=0.2
 92 |         )
 93 | 
 94 |         X_train_categorical = X_train[
 95 |             ['embarked', 'sex', 'pclass', 'title', 'is_alone']
 96 |         ]
 97 |         X_test_categorical = X_test[['embarked', 'sex', 'pclass', 'title', 'is_alone']]
 98 | 
 99 |         one_hot_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False).fit(
100 |             X_train_categorical
101 |         )
102 |         X_train_categorical_one_hot = one_hot_encoder.transform(X_train_categorical)
103 |         X_test_categorical_one_hot = one_hot_encoder.transform(X_test_categorical)
104 | 
105 |         X_train_numerical = X_train[['age', 'fare', 'family_size']]
106 |         X_test_numerical = X_test[['age', 'fare', 'family_size']]
107 |         knn_imputer = KNNImputer(n_neighbors=5).fit(X_train_numerical)
108 |         X_train_numerical_imputed = knn_imputer.transform(X_train_numerical)
109 |         X_test_numerical_imputed = knn_imputer.transform(X_test_numerical)
110 | 
111 |         robust_scaler = RobustScaler().fit(X_train_numerical_imputed)
112 |         X_train_numerical_imputed_scaled = robust_scaler.transform(
113 |             X_train_numerical_imputed
114 |         )
115 |         X_test_numerical_imputed_scaled = robust_scaler.transform(
116 |             X_test_numerical_imputed
117 |         )
118 | 
119 |         X_train_processed = np.hstack(
120 |             (X_train_categorical_one_hot, X_train_numerical_imputed_scaled)
121 |         )
122 |         X_test_processed = np.hstack(
123 |             (X_test_categorical_one_hot, X_test_numerical_imputed_scaled)
124 |         )
125 | 
126 |         model = LogisticRegression(random_state=0).fit(X_train_processed, y_train)
127 |         y_train_estimation = model.predict(X_train_processed)
128 |         y_test_estimation = model.predict(X_test_processed)
129 | 
130 |         cm_train = confusion_matrix(y_train, y_train_estimation)
131 | 
132 |         cm_test = confusion_matrix(y_test, y_test_estimation)
133 | 
134 |         print('cm_train', cm_train)
135 |         print('cm_test', cm_test)
136 | 
137 |         do_test('../data/cm_test.pkl', cm_test)
138 |         do_test('../data/cm_train.pkl', cm_train)
139 |         do_test('../data/X_train_processed.pkl', X_train_processed)
140 |         do_test('../data/X_test_processed.pkl', X_test_processed)
141 | 
142 |         do_pandas_test('../data/df.pkl', df)
143 | 
144 | 
145 | def main(param: str = 'pass'):
146 |     titanic_model_creator = TitanicModelCreator()
147 |     titanic_model_creator.run()
148 | 
149 | 
150 | if __name__ == "__main__":
151 |     typer.run(main)
152 | 


--------------------------------------------------------------------------------
/Step06/titanic_model.sh:
--------------------------------------------------------------------------------
1 | python3.8 ./titanic_model.py $1
2 | 


--------------------------------------------------------------------------------
/Step07/README06.md:
--------------------------------------------------------------------------------
1 | ### Step 06: Decouple from the database
2 | 
3 | - Create loader propery and argument in `TitanicModelCreator.__init__()`
4 | - Remove the database loader instantiation from the `run()` function
5 | - Update `TitanicModelCreator` construction to create the loader there
6 | 
7 | This will enable for the `TitanicModelCreator` to load data from any source for example files. Preparing to build a test context for rapid iteration. After you created the adapter class, this will do the decoupling. This is an example of "Dependency Injection", when a property of your main code is not written into the main body of the code but instead "plugged in" at constrcution time. The benefit of Dependency Injection is that you can change the behaviour of your code without rewriting it by purely changing its construction. As the saying goes: "Complex behaviour is constructed not written." Dependency Injection Principle is the `D` in the famed `SOLID` principles, and arguably the most important.
8 | 


--------------------------------------------------------------------------------
/Step07/make_venv.sh:
--------------------------------------------------------------------------------
1 | python3.8 -m venv .venv
2 | source .venv/bin/activate
3 | pip3 install --upgrade pip
4 | pip3 install setuptools==57.1.0
5 | pip3 install wheel
6 | pip3 install -r requirements.txt
7 | 


--------------------------------------------------------------------------------
/Step07/requirements.txt:
--------------------------------------------------------------------------------
1 | typer
2 | scikit-learn
3 | jupyter
4 | pandas
5 | numpy
6 | sqlalchemy
7 | pydantic
8 | 


--------------------------------------------------------------------------------
/Step07/titanic_model.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import pickle
  3 | import typer
  4 | import numpy as np
  5 | import pandas as pd
  6 | from collections import Counter
  7 | from sqlalchemy import create_engine
  8 | 
  9 | from sklearn.model_selection import train_test_split
 10 | from sklearn.linear_model import LogisticRegression
 11 | from sklearn.preprocessing import RobustScaler
 12 | from sklearn.preprocessing import OneHotEncoder
 13 | from sklearn.impute import KNNImputer
 14 | from sklearn.metrics import confusion_matrix
 15 | 
 16 | 
 17 | def do_test(filename, data):
 18 |     if not os.path.isfile(filename):
 19 |         pickle.dump(data, open(filename, 'wb'))
 20 |     truth = pickle.load(open(filename, 'rb'))
 21 |     try:
 22 |         np.testing.assert_almost_equal(data, truth)
 23 |         print(f'{filename} test passed')
 24 |     except AssertionError as ex:
 25 |         print(f'{filename} test failed {ex}')
 26 | 
 27 | 
 28 | def do_pandas_test(filename, data):
 29 |     if not os.path.isfile(filename):
 30 |         data.to_pickle(filename)
 31 |     truth = pd.read_pickle(filename)
 32 |     try:
 33 |         pd.testing.assert_frame_equal(data, truth)
 34 |         print(f'{filename} pandas test passed')
 35 |     except AssertionError as ex:
 36 |         print(f'{filename} pandas test failed {ex}')
 37 | 
 38 | 
 39 | class SqlLoader:
 40 |     def __init__(self, connection_string):
 41 |         engine = create_engine(connection_string)
 42 |         self.connection = engine.connect()
 43 | 
 44 |     def get_passengers(self):
 45 |         query = 'SELECT * FROM tbl_passengers'
 46 |         return pd.read_sql(query, con=self.connection)
 47 | 
 48 |     def get_targets(self):
 49 |         query = 'SELECT * FROM tbl_targets'
 50 |         return pd.read_sql(query, con=self.connection)
 51 | 
 52 | 
 53 | class TitanicModelCreator:
 54 |     def __init__(self, loader):
 55 |         self.loader = loader
 56 |         np.random.seed(42)
 57 | 
 58 |     def run(self):
 59 |         df = self.loader.get_passengers()
 60 |         targets = self.loader.get_targets()
 61 | 
 62 |         # parch = Parents/Children, sibsp = Siblings/Spouses
 63 |         df['family_size'] = df['parch'] + df['sibsp']
 64 |         df['is_alone'] = [
 65 |             1 if family_size == 1 else 0 for family_size in df['family_size']
 66 |         ]
 67 | 
 68 |         df['title'] = [name.split(',')[1].split('.')[0].strip() for name in df['name']]
 69 |         rare_titles = {k for k, v in Counter(df['title']).items() if v < 10}
 70 |         df['title'] = [
 71 |             'rare' if title in rare_titles else title for title in df['title']
 72 |         ]
 73 | 
 74 |         df = df[
 75 |             [
 76 |                 'pclass',
 77 |                 'sex',
 78 |                 'age',
 79 |                 'ticket',
 80 |                 'family_size',
 81 |                 'fare',
 82 |                 'embarked',
 83 |                 'is_alone',
 84 |                 'title',
 85 |             ]
 86 |         ]
 87 | 
 88 |         targets = [int(v) for v in targets['is_survived']]
 89 |         X_train, X_test, y_train, y_test = train_test_split(
 90 |             df, targets, stratify=targets, test_size=0.2
 91 |         )
 92 | 
 93 |         X_train_categorical = X_train[
 94 |             ['embarked', 'sex', 'pclass', 'title', 'is_alone']
 95 |         ]
 96 |         X_test_categorical = X_test[['embarked', 'sex', 'pclass', 'title', 'is_alone']]
 97 | 
 98 |         one_hot_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False).fit(
 99 |             X_train_categorical
100 |         )
101 |         X_train_categorical_one_hot = one_hot_encoder.transform(X_train_categorical)
102 |         X_test_categorical_one_hot = one_hot_encoder.transform(X_test_categorical)
103 | 
104 |         X_train_numerical = X_train[['age', 'fare', 'family_size']]
105 |         X_test_numerical = X_test[['age', 'fare', 'family_size']]
106 |         knn_imputer = KNNImputer(n_neighbors=5).fit(X_train_numerical)
107 |         X_train_numerical_imputed = knn_imputer.transform(X_train_numerical)
108 |         X_test_numerical_imputed = knn_imputer.transform(X_test_numerical)
109 | 
110 |         robust_scaler = RobustScaler().fit(X_train_numerical_imputed)
111 |         X_train_numerical_imputed_scaled = robust_scaler.transform(
112 |             X_train_numerical_imputed
113 |         )
114 |         X_test_numerical_imputed_scaled = robust_scaler.transform(
115 |             X_test_numerical_imputed
116 |         )
117 | 
118 |         X_train_processed = np.hstack(
119 |             (X_train_categorical_one_hot, X_train_numerical_imputed_scaled)
120 |         )
121 |         X_test_processed = np.hstack(
122 |             (X_test_categorical_one_hot, X_test_numerical_imputed_scaled)
123 |         )
124 | 
125 |         model = LogisticRegression(random_state=0).fit(X_train_processed, y_train)
126 |         y_train_estimation = model.predict(X_train_processed)
127 |         y_test_estimation = model.predict(X_test_processed)
128 | 
129 |         cm_train = confusion_matrix(y_train, y_train_estimation)
130 | 
131 |         cm_test = confusion_matrix(y_test, y_test_estimation)
132 | 
133 |         print('cm_train', cm_train)
134 |         print('cm_test', cm_test)
135 | 
136 |         do_test('../data/cm_test.pkl', cm_test)
137 |         do_test('../data/cm_train.pkl', cm_train)
138 |         do_test('../data/X_train_processed.pkl', X_train_processed)
139 |         do_test('../data/X_test_processed.pkl', X_test_processed)
140 | 
141 |         do_pandas_test('../data/df.pkl', df)
142 | 
143 | 
144 | def main(param: str = 'pass'):
145 |     titanic_model_creator = TitanicModelCreator(
146 |         loader=SqlLoader(connection_string='sqlite:///../data/titanic.db')
147 |     )
148 |     titanic_model_creator.run()
149 | 
150 | 
151 | if __name__ == "__main__":
152 |     typer.run(main)
153 | 


--------------------------------------------------------------------------------
/Step07/titanic_model.sh:
--------------------------------------------------------------------------------
1 | python3.8 ./titanic_model.py $1
2 | 


--------------------------------------------------------------------------------
/Step08/README07.md:
--------------------------------------------------------------------------------
1 | ### Step 07: Write testing dataloader
2 | 
3 | - Write a class that loads the required data from files
4 | - Same interface as `SqlLoader`
5 | - Add a "real" loader to it as a property
6 | 
7 | This will allow the test context to work without DB connection and still have the DB as a fallback when you run it for the first time. For `TitanicModelCreator` the two loaders are indistinguishable as they have the same interface.
8 | 


--------------------------------------------------------------------------------
/Step08/make_venv.sh:
--------------------------------------------------------------------------------
1 | python3.8 -m venv .venv
2 | source .venv/bin/activate
3 | pip3 install --upgrade pip
4 | pip3 install setuptools==57.1.0
5 | pip3 install wheel
6 | pip3 install -r requirements.txt
7 | 


--------------------------------------------------------------------------------
/Step08/requirements.txt:
--------------------------------------------------------------------------------
1 | typer
2 | scikit-learn
3 | jupyter
4 | pandas
5 | numpy
6 | sqlalchemy
7 | pydantic
8 | 


--------------------------------------------------------------------------------
/Step08/titanic_model.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import pickle
  3 | import typer
  4 | import numpy as np
  5 | import pandas as pd
  6 | from collections import Counter
  7 | from sqlalchemy import create_engine
  8 | 
  9 | from sklearn.model_selection import train_test_split
 10 | from sklearn.linear_model import LogisticRegression
 11 | from sklearn.preprocessing import RobustScaler
 12 | from sklearn.preprocessing import OneHotEncoder
 13 | from sklearn.impute import KNNImputer
 14 | from sklearn.metrics import confusion_matrix
 15 | 
 16 | 
 17 | def do_test(filename, data):
 18 |     if not os.path.isfile(filename):
 19 |         pickle.dump(data, open(filename, 'wb'))
 20 |     truth = pickle.load(open(filename, 'rb'))
 21 |     try:
 22 |         np.testing.assert_almost_equal(data, truth)
 23 |         print(f'{filename} test passed')
 24 |     except AssertionError as ex:
 25 |         print(f'{filename} test failed {ex}')
 26 | 
 27 | 
 28 | def do_pandas_test(filename, data):
 29 |     if not os.path.isfile(filename):
 30 |         data.to_pickle(filename)
 31 |     truth = pd.read_pickle(filename)
 32 |     try:
 33 |         pd.testing.assert_frame_equal(data, truth)
 34 |         print(f'{filename} pandas test passed')
 35 |     except AssertionError as ex:
 36 |         print(f'{filename} pandas test failed {ex}')
 37 | 
 38 | 
 39 | class SqlLoader:
 40 |     def __init__(self, connection_string):
 41 |         engine = create_engine(connection_string)
 42 |         self.connection = engine.connect()
 43 | 
 44 |     def get_passengers(self):
 45 |         query = 'SELECT * FROM tbl_passengers'
 46 |         return pd.read_sql(query, con=self.connection)
 47 | 
 48 |     def get_targets(self):
 49 |         query = 'SELECT * FROM tbl_targets'
 50 |         return pd.read_sql(query, con=self.connection)
 51 | 
 52 | 
 53 | class TestLoader:
 54 |     def __init__(self, passengers_filename, targets_filename, real_loader):
 55 |         self.passengers_filename = passengers_filename
 56 |         self.targets_filename = targets_filename
 57 |         self.real_loader = real_loader
 58 |         if not os.path.isfile(self.passengers_filename):
 59 |             df = self.real_loader.get_passengers()
 60 |             df.to_pickle(self.passengers_filename)
 61 |         if not os.path.isfile(self.targets_filename):
 62 |             df = self.real_loader.get_targets()
 63 |             df.to_pickle(self.targets_filename)
 64 | 
 65 |     def get_passengers(self):
 66 |         return pd.read_pickle(self.passengers_filename)
 67 | 
 68 |     def get_targets(self):
 69 |         return pd.read_pickle(self.targets_filename)
 70 | 
 71 | 
 72 | class TitanicModelCreator:
 73 |     def __init__(self, loader):
 74 |         self.loader = loader
 75 |         np.random.seed(42)
 76 | 
 77 |     def run(self):
 78 |         df = self.loader.get_passengers()
 79 |         targets = self.loader.get_targets()
 80 | 
 81 |         # parch = Parents/Children, sibsp = Siblings/Spouses
 82 |         df['family_size'] = df['parch'] + df['sibsp']
 83 |         df['is_alone'] = [
 84 |             1 if family_size == 1 else 0 for family_size in df['family_size']
 85 |         ]
 86 | 
 87 |         df['title'] = [name.split(',')[1].split('.')[0].strip() for name in df['name']]
 88 |         rare_titles = {k for k, v in Counter(df['title']).items() if v < 10}
 89 |         df['title'] = [
 90 |             'rare' if title in rare_titles else title for title in df['title']
 91 |         ]
 92 | 
 93 |         df = df[
 94 |             [
 95 |                 'pclass',
 96 |                 'sex',
 97 |                 'age',
 98 |                 'ticket',
 99 |                 'family_size',
100 |                 'fare',
101 |                 'embarked',
102 |                 'is_alone',
103 |                 'title',
104 |             ]
105 |         ]
106 | 
107 |         targets = [int(v) for v in targets['is_survived']]
108 |         X_train, X_test, y_train, y_test = train_test_split(
109 |             df, targets, stratify=targets, test_size=0.2
110 |         )
111 | 
112 |         X_train_categorical = X_train[
113 |             ['embarked', 'sex', 'pclass', 'title', 'is_alone']
114 |         ]
115 |         X_test_categorical = X_test[['embarked', 'sex', 'pclass', 'title', 'is_alone']]
116 | 
117 |         one_hot_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False).fit(
118 |             X_train_categorical
119 |         )
120 |         X_train_categorical_one_hot = one_hot_encoder.transform(X_train_categorical)
121 |         X_test_categorical_one_hot = one_hot_encoder.transform(X_test_categorical)
122 | 
123 |         X_train_numerical = X_train[['age', 'fare', 'family_size']]
124 |         X_test_numerical = X_test[['age', 'fare', 'family_size']]
125 |         knn_imputer = KNNImputer(n_neighbors=5).fit(X_train_numerical)
126 |         X_train_numerical_imputed = knn_imputer.transform(X_train_numerical)
127 |         X_test_numerical_imputed = knn_imputer.transform(X_test_numerical)
128 | 
129 |         robust_scaler = RobustScaler().fit(X_train_numerical_imputed)
130 |         X_train_numerical_imputed_scaled = robust_scaler.transform(
131 |             X_train_numerical_imputed
132 |         )
133 |         X_test_numerical_imputed_scaled = robust_scaler.transform(
134 |             X_test_numerical_imputed
135 |         )
136 | 
137 |         X_train_processed = np.hstack(
138 |             (X_train_categorical_one_hot, X_train_numerical_imputed_scaled)
139 |         )
140 |         X_test_processed = np.hstack(
141 |             (X_test_categorical_one_hot, X_test_numerical_imputed_scaled)
142 |         )
143 | 
144 |         model = LogisticRegression(random_state=0).fit(X_train_processed, y_train)
145 |         y_train_estimation = model.predict(X_train_processed)
146 |         y_test_estimation = model.predict(X_test_processed)
147 | 
148 |         cm_train = confusion_matrix(y_train, y_train_estimation)
149 | 
150 |         cm_test = confusion_matrix(y_test, y_test_estimation)
151 | 
152 |         print('cm_train', cm_train)
153 |         print('cm_test', cm_test)
154 | 
155 |         do_test('../data/cm_test.pkl', cm_test)
156 |         do_test('../data/cm_train.pkl', cm_train)
157 |         do_test('../data/X_train_processed.pkl', X_train_processed)
158 |         do_test('../data/X_test_processed.pkl', X_test_processed)
159 | 
160 |         do_pandas_test('../data/df.pkl', df)
161 | 
162 | 
163 | def main(param: str = 'pass'):
164 |     titanic_model_creator = TitanicModelCreator(
165 |         loader=SqlLoader(connection_string='sqlite:///../data/titanic.db')
166 |     )
167 |     titanic_model_creator.run()
168 | 
169 | 
170 | if __name__ == "__main__":
171 |     typer.run(main)
172 | 


--------------------------------------------------------------------------------
/Step08/titanic_model.sh:
--------------------------------------------------------------------------------
1 | python3.8 ./titanic_model.py $1
2 | 


--------------------------------------------------------------------------------
/Step09/README08.md:
--------------------------------------------------------------------------------
 1 | ### Step 08: Write the test context
 2 | 
 3 | - Create `test_main` function
 4 | - Make sure `typer` calls that in `typer.run()`
 5 | - Copy the code from `main` to `test_main`
 6 | - Replace `SqlLoader` in it with `TestLoader`
 7 | 
 8 | From now on this is the only code that is tested. The costly connection to the DB is replaced with a file load. Also if it is still not fast enough, additional parameter can reduce the amount of data in the test to make the process faster. [How can a Data Scientist refactor Jupyter notebooks towards production-quality code?](https://laszlo.substack.com/p/how-can-a-data-scientist-refactor) [I appreciate this might be terse. Comment, open an issue, vote on it if you would like to have a detailed discussion on this - Laszlo]
 9 | 
10 | This is the essence of the importance of Clean Architecture and code reuse. Every code will be used in two different context: test and "production" by injecting different dependencies. Because the same code runs in both places there is no time spent on translating from one to another. The test setup should reflect production context as close as possible so when a test fail or pass you can think that the same will happen in production as well. This speed up iteration because you can freely experiment in the test context and only deploy code into "production" when you are convinced it is doing what you think it should do. But it is the same code, so deployment is effortless.
11 | 


--------------------------------------------------------------------------------
/Step09/make_venv.sh:
--------------------------------------------------------------------------------
1 | python3.8 -m venv .venv
2 | source .venv/bin/activate
3 | pip3 install --upgrade pip
4 | pip3 install setuptools==57.1.0
5 | pip3 install wheel
6 | pip3 install -r requirements.txt
7 | 


--------------------------------------------------------------------------------
/Step09/requirements.txt:
--------------------------------------------------------------------------------
1 | typer
2 | scikit-learn
3 | jupyter
4 | pandas
5 | numpy
6 | sqlalchemy
7 | pydantic
8 | 


--------------------------------------------------------------------------------
/Step09/titanic_model.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import pickle
  3 | import typer
  4 | import numpy as np
  5 | import pandas as pd
  6 | from collections import Counter
  7 | from sqlalchemy import create_engine
  8 | 
  9 | from sklearn.model_selection import train_test_split
 10 | from sklearn.linear_model import LogisticRegression
 11 | from sklearn.preprocessing import RobustScaler
 12 | from sklearn.preprocessing import OneHotEncoder
 13 | from sklearn.impute import KNNImputer
 14 | from sklearn.metrics import confusion_matrix
 15 | 
 16 | 
 17 | def do_test(filename, data):
 18 |     if not os.path.isfile(filename):
 19 |         pickle.dump(data, open(filename, 'wb'))
 20 |     truth = pickle.load(open(filename, 'rb'))
 21 |     try:
 22 |         np.testing.assert_almost_equal(data, truth)
 23 |         print(f'{filename} test passed')
 24 |     except AssertionError as ex:
 25 |         print(f'{filename} test failed {ex}')
 26 | 
 27 | 
 28 | def do_pandas_test(filename, data):
 29 |     if not os.path.isfile(filename):
 30 |         data.to_pickle(filename)
 31 |     truth = pd.read_pickle(filename)
 32 |     try:
 33 |         pd.testing.assert_frame_equal(data, truth)
 34 |         print(f'{filename} pandas test passed')
 35 |     except AssertionError as ex:
 36 |         print(f'{filename} pandas test failed {ex}')
 37 | 
 38 | 
 39 | class SqlLoader:
 40 |     def __init__(self, connection_string):
 41 |         engine = create_engine(connection_string)
 42 |         self.connection = engine.connect()
 43 | 
 44 |     def get_passengers(self):
 45 |         query = 'SELECT * FROM tbl_passengers'
 46 |         return pd.read_sql(query, con=self.connection)
 47 | 
 48 |     def get_targets(self):
 49 |         query = 'SELECT * FROM tbl_targets'
 50 |         return pd.read_sql(query, con=self.connection)
 51 | 
 52 | 
 53 | class TestLoader:
 54 |     def __init__(self, passengers_filename, targets_filename, real_loader):
 55 |         self.passengers_filename = passengers_filename
 56 |         self.targets_filename = targets_filename
 57 |         self.real_loader = real_loader
 58 |         if not os.path.isfile(self.passengers_filename):
 59 |             df = self.real_loader.get_passengers()
 60 |             df.to_pickle(self.passengers_filename)
 61 |         if not os.path.isfile(self.targets_filename):
 62 |             df = self.real_loader.get_targets()
 63 |             df.to_pickle(self.targets_filename)
 64 | 
 65 |     def get_passengers(self):
 66 |         return pd.read_pickle(self.passengers_filename)
 67 | 
 68 |     def get_targets(self):
 69 |         return pd.read_pickle(self.targets_filename)
 70 | 
 71 | 
 72 | class TitanicModelCreator:
 73 |     def __init__(self, loader):
 74 |         self.loader = loader
 75 |         np.random.seed(42)
 76 | 
 77 |     def run(self):
 78 |         df = self.loader.get_passengers()
 79 |         targets = self.loader.get_targets()
 80 | 
 81 |         # parch = Parents/Children, sibsp = Siblings/Spouses
 82 |         df['family_size'] = df['parch'] + df['sibsp']
 83 |         df['is_alone'] = [
 84 |             1 if family_size == 1 else 0 for family_size in df['family_size']
 85 |         ]
 86 | 
 87 |         df['title'] = [name.split(',')[1].split('.')[0].strip() for name in df['name']]
 88 |         rare_titles = {k for k, v in Counter(df['title']).items() if v < 10}
 89 |         df['title'] = [
 90 |             'rare' if title in rare_titles else title for title in df['title']
 91 |         ]
 92 | 
 93 |         df = df[
 94 |             [
 95 |                 'pclass',
 96 |                 'sex',
 97 |                 'age',
 98 |                 'ticket',
 99 |                 'family_size',
100 |                 'fare',
101 |                 'embarked',
102 |                 'is_alone',
103 |                 'title',
104 |             ]
105 |         ]
106 | 
107 |         targets = [int(v) for v in targets['is_survived']]
108 |         X_train, X_test, y_train, y_test = train_test_split(
109 |             df, targets, stratify=targets, test_size=0.2
110 |         )
111 | 
112 |         X_train_categorical = X_train[
113 |             ['embarked', 'sex', 'pclass', 'title', 'is_alone']
114 |         ]
115 |         X_test_categorical = X_test[['embarked', 'sex', 'pclass', 'title', 'is_alone']]
116 | 
117 |         one_hot_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False).fit(
118 |             X_train_categorical
119 |         )
120 |         X_train_categorical_one_hot = one_hot_encoder.transform(X_train_categorical)
121 |         X_test_categorical_one_hot = one_hot_encoder.transform(X_test_categorical)
122 | 
123 |         X_train_numerical = X_train[['age', 'fare', 'family_size']]
124 |         X_test_numerical = X_test[['age', 'fare', 'family_size']]
125 |         knn_imputer = KNNImputer(n_neighbors=5).fit(X_train_numerical)
126 |         X_train_numerical_imputed = knn_imputer.transform(X_train_numerical)
127 |         X_test_numerical_imputed = knn_imputer.transform(X_test_numerical)
128 | 
129 |         robust_scaler = RobustScaler().fit(X_train_numerical_imputed)
130 |         X_train_numerical_imputed_scaled = robust_scaler.transform(
131 |             X_train_numerical_imputed
132 |         )
133 |         X_test_numerical_imputed_scaled = robust_scaler.transform(
134 |             X_test_numerical_imputed
135 |         )
136 | 
137 |         X_train_processed = np.hstack(
138 |             (X_train_categorical_one_hot, X_train_numerical_imputed_scaled)
139 |         )
140 |         X_test_processed = np.hstack(
141 |             (X_test_categorical_one_hot, X_test_numerical_imputed_scaled)
142 |         )
143 | 
144 |         model = LogisticRegression(random_state=0).fit(X_train_processed, y_train)
145 |         y_train_estimation = model.predict(X_train_processed)
146 |         y_test_estimation = model.predict(X_test_processed)
147 | 
148 |         cm_train = confusion_matrix(y_train, y_train_estimation)
149 | 
150 |         cm_test = confusion_matrix(y_test, y_test_estimation)
151 | 
152 |         print('cm_train', cm_train)
153 |         print('cm_test', cm_test)
154 | 
155 |         do_test('../data/cm_test.pkl', cm_test)
156 |         do_test('../data/cm_train.pkl', cm_train)
157 |         do_test('../data/X_train_processed.pkl', X_train_processed)
158 |         do_test('../data/X_test_processed.pkl', X_test_processed)
159 | 
160 |         do_pandas_test('../data/df.pkl', df)
161 | 
162 | 
163 | def main(param: str = 'pass'):
164 |     titanic_model_creator = TitanicModelCreator(
165 |         loader=SqlLoader(connection_string='sqlite:///../data/titanic.db')
166 |     )
167 |     titanic_model_creator.run()
168 | 
169 | 
170 | def test_main(param: str = 'pass'):
171 |     titanic_model_creator = TitanicModelCreator(
172 |         loader=TestLoader(
173 |             passengers_filename='../data/passengers.pkl',
174 |             targets_filename='../data/targets.pkl',
175 |             real_loader=SqlLoader(connection_string='sqlite:///../data/titanic.db'),
176 |         )
177 |     )
178 |     titanic_model_creator.run()
179 | 
180 | 
181 | if __name__ == "__main__":
182 |     typer.run(test_main)
183 | 


--------------------------------------------------------------------------------
/Step09/titanic_model.sh:
--------------------------------------------------------------------------------
1 | python3.8 ./titanic_model.py $1
2 | 


--------------------------------------------------------------------------------
/Step10/README09.md:
--------------------------------------------------------------------------------
 1 | ### Step 09: Merge passenger data with targets
 2 | 
 3 | - Remove the `get_targets()` interface
 4 | - Replace the query in `SqlLoader`
 5 | - Remove any code related to `targets`
 6 | 
 7 | This is a step to prepare to build the "domain data model". The Titanic model is about survival of her passengers. For the code to align this domain the concept of "passengers" need to be introduced (as a class/object). A passenger either survived or not, it's an attribute of the passenger and it need to be implemented like that.
 8 | 
 9 | This is a critical part of the code quality journey and building better systems. Once you introduce these concepts your code will depend directly on the business problem you are solving not the various representations the data is stored (pandas, numpy, csv, etc). I wrote about this many times on my blog:
10 | 
11 | - [3 Ways Domain Data Models help Data Science Projects](https://laszlo.substack.com/p/3-ways-domain-data-models-help-data)
12 | - [Clean Architecture: How to structure your ML projects to reduce technical debt](https://laszlo.substack.com/p/slides-for-my-talk-at-pydata-london)
13 | - [How did I change my mind about dataclasses in ML projects?](https://laszlo.substack.com/p/how-did-i-change-my-mind-about-dataclasses)
14 | 


--------------------------------------------------------------------------------
/Step10/make_venv.sh:
--------------------------------------------------------------------------------
1 | python3.8 -m venv .venv
2 | source .venv/bin/activate
3 | pip3 install --upgrade pip
4 | pip3 install setuptools==57.1.0
5 | pip3 install wheel
6 | pip3 install -r requirements.txt
7 | 


--------------------------------------------------------------------------------
/Step10/requirements.txt:
--------------------------------------------------------------------------------
1 | typer
2 | scikit-learn
3 | jupyter
4 | pandas
5 | numpy
6 | sqlalchemy
7 | pydantic
8 | 


--------------------------------------------------------------------------------
/Step10/titanic_model.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import pickle
  3 | import typer
  4 | import numpy as np
  5 | import pandas as pd
  6 | from collections import Counter
  7 | from sqlalchemy import create_engine
  8 | 
  9 | from sklearn.model_selection import train_test_split
 10 | from sklearn.linear_model import LogisticRegression
 11 | from sklearn.preprocessing import RobustScaler
 12 | from sklearn.preprocessing import OneHotEncoder
 13 | from sklearn.impute import KNNImputer
 14 | from sklearn.metrics import confusion_matrix
 15 | 
 16 | 
 17 | def do_test(filename, data):
 18 |     if not os.path.isfile(filename):
 19 |         pickle.dump(data, open(filename, 'wb'))
 20 |     truth = pickle.load(open(filename, 'rb'))
 21 |     try:
 22 |         np.testing.assert_almost_equal(data, truth)
 23 |         print(f'{filename} test passed')
 24 |     except AssertionError as ex:
 25 |         print(f'{filename} test failed {ex}')
 26 | 
 27 | 
 28 | def do_pandas_test(filename, data):
 29 |     if not os.path.isfile(filename):
 30 |         data.to_pickle(filename)
 31 |     truth = pd.read_pickle(filename)
 32 |     try:
 33 |         pd.testing.assert_frame_equal(data, truth)
 34 |         print(f'{filename} pandas test passed')
 35 |     except AssertionError as ex:
 36 |         print(f'{filename} pandas test failed {ex}')
 37 | 
 38 | 
 39 | class SqlLoader:
 40 |     def __init__(self, connection_string):
 41 |         engine = create_engine(connection_string)
 42 |         self.connection = engine.connect()
 43 | 
 44 |     def get_passengers(self):
 45 |         query = """
 46 |             SELECT
 47 |                 tbl_passengers.*,
 48 |                 tbl_targets.is_survived
 49 |             FROM
 50 |                 tbl_passengers
 51 |             JOIN
 52 |                 tbl_targets
 53 |             ON
 54 |                 tbl_passengers.pid=tbl_targets.pid
 55 |         """
 56 |         return pd.read_sql(query, con=self.connection)
 57 | 
 58 | 
 59 | class TestLoader:
 60 |     def __init__(self, passengers_filename, real_loader):
 61 |         self.passengers_filename = passengers_filename
 62 |         self.real_loader = real_loader
 63 |         if not os.path.isfile(self.passengers_filename):
 64 |             df = self.real_loader.get_passengers()
 65 |             df.to_pickle(self.passengers_filename)
 66 | 
 67 |     def get_passengers(self):
 68 |         return pd.read_pickle(self.passengers_filename)
 69 | 
 70 | 
 71 | class TitanicModelCreator:
 72 |     def __init__(self, loader):
 73 |         self.loader = loader
 74 |         np.random.seed(42)
 75 | 
 76 |     def run(self):
 77 |         df = self.loader.get_passengers()
 78 | 
 79 |         # parch = Parents/Children, sibsp = Siblings/Spouses
 80 |         df['family_size'] = df['parch'] + df['sibsp']
 81 |         df['is_alone'] = [
 82 |             1 if family_size == 1 else 0 for family_size in df['family_size']
 83 |         ]
 84 | 
 85 |         df['title'] = [name.split(',')[1].split('.')[0].strip() for name in df['name']]
 86 |         rare_titles = {k for k, v in Counter(df['title']).items() if v < 10}
 87 |         df['title'] = [
 88 |             'rare' if title in rare_titles else title for title in df['title']
 89 |         ]
 90 | 
 91 |         targets = [int(v) for v in df['is_survived']]
 92 |         df = df[
 93 |             [
 94 |                 'pclass',
 95 |                 'sex',
 96 |                 'age',
 97 |                 'ticket',
 98 |                 'family_size',
 99 |                 'fare',
100 |                 'embarked',
101 |                 'is_alone',
102 |                 'title',
103 |             ]
104 |         ]
105 | 
106 |         X_train, X_test, y_train, y_test = train_test_split(
107 |             df, targets, stratify=targets, test_size=0.2
108 |         )
109 | 
110 |         X_train_categorical = X_train[
111 |             ['embarked', 'sex', 'pclass', 'title', 'is_alone']
112 |         ]
113 |         X_test_categorical = X_test[['embarked', 'sex', 'pclass', 'title', 'is_alone']]
114 | 
115 |         one_hot_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False).fit(
116 |             X_train_categorical
117 |         )
118 |         X_train_categorical_one_hot = one_hot_encoder.transform(X_train_categorical)
119 |         X_test_categorical_one_hot = one_hot_encoder.transform(X_test_categorical)
120 | 
121 |         X_train_numerical = X_train[['age', 'fare', 'family_size']]
122 |         X_test_numerical = X_test[['age', 'fare', 'family_size']]
123 |         knn_imputer = KNNImputer(n_neighbors=5).fit(X_train_numerical)
124 |         X_train_numerical_imputed = knn_imputer.transform(X_train_numerical)
125 |         X_test_numerical_imputed = knn_imputer.transform(X_test_numerical)
126 | 
127 |         robust_scaler = RobustScaler().fit(X_train_numerical_imputed)
128 |         X_train_numerical_imputed_scaled = robust_scaler.transform(
129 |             X_train_numerical_imputed
130 |         )
131 |         X_test_numerical_imputed_scaled = robust_scaler.transform(
132 |             X_test_numerical_imputed
133 |         )
134 | 
135 |         X_train_processed = np.hstack(
136 |             (X_train_categorical_one_hot, X_train_numerical_imputed_scaled)
137 |         )
138 |         X_test_processed = np.hstack(
139 |             (X_test_categorical_one_hot, X_test_numerical_imputed_scaled)
140 |         )
141 | 
142 |         model = LogisticRegression(random_state=0).fit(X_train_processed, y_train)
143 |         y_train_estimation = model.predict(X_train_processed)
144 |         y_test_estimation = model.predict(X_test_processed)
145 | 
146 |         cm_train = confusion_matrix(y_train, y_train_estimation)
147 | 
148 |         cm_test = confusion_matrix(y_test, y_test_estimation)
149 | 
150 |         print('cm_train', cm_train)
151 |         print('cm_test', cm_test)
152 | 
153 |         do_test('../data/cm_test.pkl', cm_test)
154 |         do_test('../data/cm_train.pkl', cm_train)
155 |         do_test('../data/X_train_processed.pkl', X_train_processed)
156 |         do_test('../data/X_test_processed.pkl', X_test_processed)
157 | 
158 |         do_pandas_test('../data/df.pkl', df)
159 | 
160 | 
161 | def main(param: str = 'pass'):
162 |     titanic_model_creator = TitanicModelCreator(
163 |         loader=SqlLoader(connection_string='sqlite:///../data/titanic.db')
164 |     )
165 |     titanic_model_creator.run()
166 | 
167 | 
168 | def test_main(param: str = 'pass'):
169 |     titanic_model_creator = TitanicModelCreator(
170 |         loader=TestLoader(
171 |             passengers_filename='../data/passengers_with_is_survived.pkl',
172 |             real_loader=SqlLoader(connection_string='sqlite:///../data/titanic.db'),
173 |         )
174 |     )
175 |     titanic_model_creator.run()
176 | 
177 | 
178 | if __name__ == "__main__":
179 |     typer.run(test_main)
180 | 


--------------------------------------------------------------------------------
/Step10/titanic_model.sh:
--------------------------------------------------------------------------------
1 | python3.8 ./titanic_model.py $1
2 | 


--------------------------------------------------------------------------------
/Step11/README10.md:
--------------------------------------------------------------------------------
 1 | ### Step 10: Create Passenger class
 2 | 
 3 | - Import `BaseModel` from `pydantic`
 4 | - Create the class by inspecting:
 5 |   - The `dtype` of columns used in `df`
 6 |   - The actual values in `df`
 7 |   - The names of the columns that are used later in the code
 8 | 
 9 | There is really no shortcut here. In a "real" project defining this class would be the first step, but in legacy you need to deal with it later. The benefit of domain data objects is that any time you use them you can assume they fulfill a set of assumptions. These can be made explicit with `pydantic's` validators. One goal of the refactoring is to make sure that most interaction between classes happen with domain data objects. This simplifies structuring the project, any future data related change has a well defined place to happen.
10 | 


--------------------------------------------------------------------------------
/Step11/make_venv.sh:
--------------------------------------------------------------------------------
1 | python3.8 -m venv .venv
2 | source .venv/bin/activate
3 | pip3 install --upgrade pip
4 | pip3 install setuptools==57.1.0
5 | pip3 install wheel
6 | pip3 install -r requirements.txt
7 | 


--------------------------------------------------------------------------------
/Step11/requirements.txt:
--------------------------------------------------------------------------------
1 | typer
2 | scikit-learn
3 | jupyter
4 | pandas
5 | numpy
6 | sqlalchemy
7 | pydantic
8 | 


--------------------------------------------------------------------------------
/Step11/titanic_model.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import pickle
  3 | import typer
  4 | import numpy as np
  5 | import pandas as pd
  6 | from pydantic import BaseModel
  7 | from collections import Counter
  8 | from sqlalchemy import create_engine
  9 | 
 10 | from sklearn.model_selection import train_test_split
 11 | from sklearn.linear_model import LogisticRegression
 12 | from sklearn.preprocessing import RobustScaler
 13 | from sklearn.preprocessing import OneHotEncoder
 14 | from sklearn.impute import KNNImputer
 15 | from sklearn.metrics import confusion_matrix
 16 | 
 17 | 
 18 | class Passenger(BaseModel):
 19 |     pid: int
 20 |     pclass: int
 21 |     sex: str
 22 |     age: float
 23 |     ticket: str
 24 |     family_size: int
 25 |     fare: float
 26 |     embarked: str
 27 |     is_alone: int
 28 |     title: str
 29 |     is_survived: int
 30 | 
 31 | 
 32 | # targets = [int(v) for v in df['is_survived']]
 33 | # df = df[[
 34 | #     'pclass', 'sex', 'age', 'ticket', 'family_size',
 35 | #     'fare', 'embarked', 'is_alone', 'title',
 36 | # ]]
 37 | 
 38 | # >>> df[:3].T
 39 | #                                          0                                1                                2
 40 | # pid                                      0                                1                                2
 41 | # pclass                                 1.0                              1.0                              1.0
 42 | # name         Allen, Miss. Elisabeth Walton   Allison, Master. Hudson Trevor     Allison, Miss. Helen Loraine
 43 | # sex                                 female                             male                           female
 44 | # age                                   29.0                           0.9167                              2.0
 45 | # sibsp                                  0.0                              1.0                              1.0
 46 | # parch                                  0.0                              2.0                              2.0
 47 | # ticket                               24160                           113781                           113781
 48 | # fare                              211.3375                           151.55                           151.55
 49 | # cabin                                   B5                          C22 C26                          C22 C26
 50 | # embarked                                 S                                S                                S
 51 | # boat                                     2                               11                             None
 52 | # body                                   NaN                              NaN                              NaN
 53 | # home.dest                     St Louis, MO  Montreal, PQ / Chesterville, ON  Montreal, PQ / Chesterville, ON
 54 | # is_survived                              1                                1                                0
 55 | # >>> df.dtypes
 56 | # pid              int64
 57 | # pclass         float64
 58 | # name            object
 59 | # sex             object
 60 | # age            float64
 61 | # sibsp          float64
 62 | # parch          float64
 63 | # ticket          object
 64 | # fare           float64
 65 | # cabin           object
 66 | # embarked        object
 67 | # boat            object
 68 | # body           float64
 69 | # home.dest       object
 70 | # is_survived      int64
 71 | # >>> set(df['pclass'])
 72 | # {1.0, 2.0, 3.0}
 73 | 
 74 | 
 75 | def do_test(filename, data):
 76 |     if not os.path.isfile(filename):
 77 |         pickle.dump(data, open(filename, 'wb'))
 78 |     truth = pickle.load(open(filename, 'rb'))
 79 |     try:
 80 |         np.testing.assert_almost_equal(data, truth)
 81 |         print(f'{filename} test passed')
 82 |     except AssertionError as ex:
 83 |         print(f'{filename} test failed {ex}')
 84 | 
 85 | 
 86 | def do_pandas_test(filename, data):
 87 |     if not os.path.isfile(filename):
 88 |         data.to_pickle(filename)
 89 |     truth = pd.read_pickle(filename)
 90 |     try:
 91 |         pd.testing.assert_frame_equal(data, truth)
 92 |         print(f'{filename} pandas test passed')
 93 |     except AssertionError as ex:
 94 |         print(f'{filename} pandas test failed {ex}')
 95 | 
 96 | 
 97 | class SqlLoader:
 98 |     def __init__(self, connection_string):
 99 |         engine = create_engine(connection_string)
100 |         self.connection = engine.connect()
101 | 
102 |     def get_passengers(self):
103 |         query = """
104 |             SELECT
105 |                 tbl_passengers.*,
106 |                 tbl_targets.is_survived
107 |             FROM
108 |                 tbl_passengers
109 |             JOIN
110 |                 tbl_targets
111 |             ON
112 |                 tbl_passengers.pid=tbl_targets.pid
113 |         """
114 |         return pd.read_sql(query, con=self.connection)
115 | 
116 | 
117 | class TestLoader:
118 |     def __init__(self, passengers_filename, real_loader):
119 |         self.passengers_filename = passengers_filename
120 |         self.real_loader = real_loader
121 |         if not os.path.isfile(self.passengers_filename):
122 |             df = self.real_loader.get_passengers()
123 |             df.to_pickle(self.passengers_filename)
124 | 
125 |     def get_passengers(self):
126 |         return pd.read_pickle(self.passengers_filename)
127 | 
128 | 
129 | class TitanicModelCreator:
130 |     def __init__(self, loader):
131 |         self.loader = loader
132 |         np.random.seed(42)
133 | 
134 |     def run(self):
135 |         df = self.loader.get_passengers()
136 | 
137 |         # parch = Parents/Children, sibsp = Siblings/Spouses
138 |         df['family_size'] = df['parch'] + df['sibsp']
139 |         df['is_alone'] = [
140 |             1 if family_size == 1 else 0 for family_size in df['family_size']
141 |         ]
142 | 
143 |         df['title'] = [name.split(',')[1].split('.')[0].strip() for name in df['name']]
144 |         rare_titles = {k for k, v in Counter(df['title']).items() if v < 10}
145 |         df['title'] = [
146 |             'rare' if title in rare_titles else title for title in df['title']
147 |         ]
148 | 
149 |         targets = [int(v) for v in df['is_survived']]
150 |         df = df[
151 |             [
152 |                 'pclass',
153 |                 'sex',
154 |                 'age',
155 |                 'ticket',
156 |                 'family_size',
157 |                 'fare',
158 |                 'embarked',
159 |                 'is_alone',
160 |                 'title',
161 |             ]
162 |         ]
163 | 
164 |         X_train, X_test, y_train, y_test = train_test_split(
165 |             df, targets, stratify=targets, test_size=0.2
166 |         )
167 | 
168 |         X_train_categorical = X_train[
169 |             ['embarked', 'sex', 'pclass', 'title', 'is_alone']
170 |         ]
171 |         X_test_categorical = X_test[['embarked', 'sex', 'pclass', 'title', 'is_alone']]
172 | 
173 |         one_hot_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False).fit(
174 |             X_train_categorical
175 |         )
176 |         X_train_categorical_one_hot = one_hot_encoder.transform(X_train_categorical)
177 |         X_test_categorical_one_hot = one_hot_encoder.transform(X_test_categorical)
178 | 
179 |         X_train_numerical = X_train[['age', 'fare', 'family_size']]
180 |         X_test_numerical = X_test[['age', 'fare', 'family_size']]
181 |         knn_imputer = KNNImputer(n_neighbors=5).fit(X_train_numerical)
182 |         X_train_numerical_imputed = knn_imputer.transform(X_train_numerical)
183 |         X_test_numerical_imputed = knn_imputer.transform(X_test_numerical)
184 | 
185 |         robust_scaler = RobustScaler().fit(X_train_numerical_imputed)
186 |         X_train_numerical_imputed_scaled = robust_scaler.transform(
187 |             X_train_numerical_imputed
188 |         )
189 |         X_test_numerical_imputed_scaled = robust_scaler.transform(
190 |             X_test_numerical_imputed
191 |         )
192 | 
193 |         X_train_processed = np.hstack(
194 |             (X_train_categorical_one_hot, X_train_numerical_imputed_scaled)
195 |         )
196 |         X_test_processed = np.hstack(
197 |             (X_test_categorical_one_hot, X_test_numerical_imputed_scaled)
198 |         )
199 | 
200 |         model = LogisticRegression(random_state=0).fit(X_train_processed, y_train)
201 |         y_train_estimation = model.predict(X_train_processed)
202 |         y_test_estimation = model.predict(X_test_processed)
203 | 
204 |         cm_train = confusion_matrix(y_train, y_train_estimation)
205 | 
206 |         cm_test = confusion_matrix(y_test, y_test_estimation)
207 | 
208 |         print('cm_train', cm_train)
209 |         print('cm_test', cm_test)
210 | 
211 |         do_test('../data/cm_test.pkl', cm_test)
212 |         do_test('../data/cm_train.pkl', cm_train)
213 |         do_test('../data/X_train_processed.pkl', X_train_processed)
214 |         do_test('../data/X_test_processed.pkl', X_test_processed)
215 | 
216 |         do_pandas_test('../data/df.pkl', df)
217 | 
218 | 
219 | def main(param: str = 'pass'):
220 |     titanic_model_creator = TitanicModelCreator(
221 |         loader=SqlLoader(connection_string='sqlite:///../data/titanic.db')
222 |     )
223 |     titanic_model_creator.run()
224 | 
225 | 
226 | def test_main(param: str = 'pass'):
227 |     titanic_model_creator = TitanicModelCreator(
228 |         loader=TestLoader(
229 |             passengers_filename='../data/passengers_with_is_survived.pkl',
230 |             real_loader=SqlLoader(connection_string='sqlite:///../data/titanic.db'),
231 |         )
232 |     )
233 |     titanic_model_creator.run()
234 | 
235 | 
236 | if __name__ == "__main__":
237 |     typer.run(test_main)
238 | 


--------------------------------------------------------------------------------
/Step11/titanic_model.sh:
--------------------------------------------------------------------------------
1 | python3.8 ./titanic_model.py $1
2 | 


--------------------------------------------------------------------------------
/Step12/README11.md:
--------------------------------------------------------------------------------
 1 | ### Step 11: Create domain data object based data loader
 2 | 
 3 | - Create `PassengerLoader` class that takes a "real"/"old" loader
 4 | - In its `get_passengers` function, load the data from the loader and create the `Passenger` objects
 5 | - Copy the data transformations from `TitanicModelCreator.run()`
 6 | 
 7 | Take a look at how the `rare_titles` variable is used in `run()`. After scanning the entire dataset for titles, the ones that appear less than 10 times are selected. This can be done only if you have access to the entire database and this list needs to be maintained. This can cause problems in a real setting when the above operation is too difficult to do. For example if you have millions of items or a constant stream. This kind of dependencies are common in legacy code and one of the goals of refactoring is to identify these and make explicit. Here we will use a constant but in a productionised environment this might need a whole separate service.
 8 | 
 9 | `PassengerLoader` implements the Factory Design Pattern. Factories are classes that create other classes, they are a type of adapter that hides away where the data is coming from and how is it stored and return only abstract domain relevant classes that you can use downstream. Factories are one of two (later increased to three) fundamentally relevant Design Patterns for Data Science workflows:
10 | 
11 | - [You only need 2 Design Patterns to improve the quality of your code in a data science project](https://laszlo.substack.com/p/you-only-need-2-design-patterns-to)
12 | - [Clean Architecture: How to structure your ML projects to reduce technical debt](https://laszlo.substack.com/p/slides-for-my-talk-at-pydata-london)
13 | 


--------------------------------------------------------------------------------
/Step12/make_venv.sh:
--------------------------------------------------------------------------------
1 | python3.8 -m venv .venv
2 | source .venv/bin/activate
3 | pip3 install --upgrade pip
4 | pip3 install setuptools==57.1.0
5 | pip3 install wheel
6 | pip3 install -r requirements.txt
7 | 


--------------------------------------------------------------------------------
/Step12/requirements.txt:
--------------------------------------------------------------------------------
1 | typer
2 | scikit-learn
3 | jupyter
4 | pandas
5 | numpy
6 | sqlalchemy
7 | pydantic
8 | 


--------------------------------------------------------------------------------
/Step12/titanic_model.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import pickle
  3 | import typer
  4 | import numpy as np
  5 | import pandas as pd
  6 | from pydantic import BaseModel
  7 | from collections import Counter
  8 | from sqlalchemy import create_engine
  9 | 
 10 | from sklearn.model_selection import train_test_split
 11 | from sklearn.linear_model import LogisticRegression
 12 | from sklearn.preprocessing import RobustScaler
 13 | from sklearn.preprocessing import OneHotEncoder
 14 | from sklearn.impute import KNNImputer
 15 | from sklearn.metrics import confusion_matrix
 16 | 
 17 | 
 18 | class Passenger(BaseModel):
 19 |     pid: int
 20 |     pclass: int
 21 |     sex: str
 22 |     age: float
 23 |     ticket: str
 24 |     family_size: int
 25 |     fare: float
 26 |     embarked: str
 27 |     is_alone: int
 28 |     title: str
 29 |     is_survived: int
 30 | 
 31 | 
 32 | # targets = [int(v) for v in df['is_survived']]
 33 | # df = df[[
 34 | #     'pclass', 'sex', 'age', 'ticket', 'family_size',
 35 | #     'fare', 'embarked', 'is_alone', 'title',
 36 | # ]]
 37 | 
 38 | # >>> df[:3].T
 39 | #                                          0                                1                                2
 40 | # pid                                      0                                1                                2
 41 | # pclass                                 1.0                              1.0                              1.0
 42 | # name         Allen, Miss. Elisabeth Walton   Allison, Master. Hudson Trevor     Allison, Miss. Helen Loraine
 43 | # sex                                 female                             male                           female
 44 | # age                                   29.0                           0.9167                              2.0
 45 | # sibsp                                  0.0                              1.0                              1.0
 46 | # parch                                  0.0                              2.0                              2.0
 47 | # ticket                               24160                           113781                           113781
 48 | # fare                              211.3375                           151.55                           151.55
 49 | # cabin                                   B5                          C22 C26                          C22 C26
 50 | # embarked                                 S                                S                                S
 51 | # boat                                     2                               11                             None
 52 | # body                                   NaN                              NaN                              NaN
 53 | # home.dest                     St Louis, MO  Montreal, PQ / Chesterville, ON  Montreal, PQ / Chesterville, ON
 54 | # is_survived                              1                                1                                0
 55 | # >>> df.dtypes
 56 | # pid              int64
 57 | # pclass         float64
 58 | # name            object
 59 | # sex             object
 60 | # age            float64
 61 | # sibsp          float64
 62 | # parch          float64
 63 | # ticket          object
 64 | # fare           float64
 65 | # cabin           object
 66 | # embarked        object
 67 | # boat            object
 68 | # body           float64
 69 | # home.dest       object
 70 | # is_survived      int64
 71 | # >>> set(df['pclass'])
 72 | # {1.0, 2.0, 3.0}
 73 | 
 74 | 
 75 | def do_test(filename, data):
 76 |     if not os.path.isfile(filename):
 77 |         pickle.dump(data, open(filename, 'wb'))
 78 |     truth = pickle.load(open(filename, 'rb'))
 79 |     try:
 80 |         np.testing.assert_almost_equal(data, truth)
 81 |         print(f'{filename} test passed')
 82 |     except AssertionError as ex:
 83 |         print(f'{filename} test failed {ex}')
 84 | 
 85 | 
 86 | def do_pandas_test(filename, data):
 87 |     if not os.path.isfile(filename):
 88 |         data.to_pickle(filename)
 89 |     truth = pd.read_pickle(filename)
 90 |     try:
 91 |         pd.testing.assert_frame_equal(data, truth)
 92 |         print(f'{filename} pandas test passed')
 93 |     except AssertionError as ex:
 94 |         print(f'{filename} pandas test failed {ex}')
 95 | 
 96 | 
 97 | class SqlLoader:
 98 |     def __init__(self, connection_string):
 99 |         engine = create_engine(connection_string)
100 |         self.connection = engine.connect()
101 | 
102 |     def get_passengers(self):
103 |         query = """
104 |             SELECT
105 |                 tbl_passengers.*,
106 |                 tbl_targets.is_survived
107 |             FROM
108 |                 tbl_passengers
109 |             JOIN
110 |                 tbl_targets
111 |             ON
112 |                 tbl_passengers.pid=tbl_targets.pid
113 |         """
114 |         return pd.read_sql(query, con=self.connection)
115 | 
116 | 
117 | class TestLoader:
118 |     def __init__(self, passengers_filename, real_loader):
119 |         self.passengers_filename = passengers_filename
120 |         self.real_loader = real_loader
121 |         if not os.path.isfile(self.passengers_filename):
122 |             df = self.real_loader.get_passengers()
123 |             df.to_pickle(self.passengers_filename)
124 | 
125 |     def get_passengers(self):
126 |         return pd.read_pickle(self.passengers_filename)
127 | 
128 | 
129 | class PassengerLoader:
130 |     def __init__(self, loader, rare_titles=None):
131 |         self.loader = loader
132 |         self.rare_titles = rare_titles
133 | 
134 |     def get_passengers(self):
135 |         passengers = []
136 |         for data in self.loader.get_passengers().itertuples():
137 |             # parch = Parents/Children, sibsp = Siblings/Spouses
138 |             family_size = int(data.parch + data.sibsp)
139 |             # Allen, Miss. Elisabeth Walton
140 |             title = data.name.split(',')[1].split('.')[0].strip()
141 |             passenger = Passenger(
142 |                 pid=int(data.pid),
143 |                 pclass=int(data.pclass),
144 |                 sex=str(data.sex),
145 |                 age=float(data.age),
146 |                 ticket=str(data.ticket),
147 |                 family_size=family_size,
148 |                 fare=float(data.fare),
149 |                 embarked=str(data.embarked),
150 |                 is_alone=1 if family_size == 1 else 0,
151 |                 title='rare' if title in self.rare_titles else title,
152 |                 is_survived=int(data.is_survived),
153 |             )
154 |             passengers.append(passenger)
155 |         return passengers
156 | 
157 | 
158 | # Not used:
159 | # cabin           object
160 | # boat            object
161 | # body           float64
162 | # home.dest       object
163 | 
164 | 
165 | class TitanicModelCreator:
166 |     def __init__(self, loader):
167 |         self.loader = loader
168 |         np.random.seed(42)
169 | 
170 |     def run(self):
171 |         df = self.loader.get_passengers()
172 | 
173 |         # parch = Parents/Children, sibsp = Siblings/Spouses
174 |         df['family_size'] = df['parch'] + df['sibsp']
175 |         df['is_alone'] = [
176 |             1 if family_size == 1 else 0 for family_size in df['family_size']
177 |         ]
178 | 
179 |         df['title'] = [name.split(',')[1].split('.')[0].strip() for name in df['name']]
180 |         rare_titles = {k for k, v in Counter(df['title']).items() if v < 10}
181 |         df['title'] = [
182 |             'rare' if title in rare_titles else title for title in df['title']
183 |         ]
184 | 
185 |         targets = [int(v) for v in df['is_survived']]
186 |         df = df[
187 |             [
188 |                 'pclass',
189 |                 'sex',
190 |                 'age',
191 |                 'ticket',
192 |                 'family_size',
193 |                 'fare',
194 |                 'embarked',
195 |                 'is_alone',
196 |                 'title',
197 |             ]
198 |         ]
199 | 
200 |         X_train, X_test, y_train, y_test = train_test_split(
201 |             df, targets, stratify=targets, test_size=0.2
202 |         )
203 | 
204 |         X_train_categorical = X_train[
205 |             ['embarked', 'sex', 'pclass', 'title', 'is_alone']
206 |         ]
207 |         X_test_categorical = X_test[['embarked', 'sex', 'pclass', 'title', 'is_alone']]
208 | 
209 |         one_hot_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False).fit(
210 |             X_train_categorical
211 |         )
212 |         X_train_categorical_one_hot = one_hot_encoder.transform(X_train_categorical)
213 |         X_test_categorical_one_hot = one_hot_encoder.transform(X_test_categorical)
214 | 
215 |         X_train_numerical = X_train[['age', 'fare', 'family_size']]
216 |         X_test_numerical = X_test[['age', 'fare', 'family_size']]
217 |         knn_imputer = KNNImputer(n_neighbors=5).fit(X_train_numerical)
218 |         X_train_numerical_imputed = knn_imputer.transform(X_train_numerical)
219 |         X_test_numerical_imputed = knn_imputer.transform(X_test_numerical)
220 | 
221 |         robust_scaler = RobustScaler().fit(X_train_numerical_imputed)
222 |         X_train_numerical_imputed_scaled = robust_scaler.transform(
223 |             X_train_numerical_imputed
224 |         )
225 |         X_test_numerical_imputed_scaled = robust_scaler.transform(
226 |             X_test_numerical_imputed
227 |         )
228 | 
229 |         X_train_processed = np.hstack(
230 |             (X_train_categorical_one_hot, X_train_numerical_imputed_scaled)
231 |         )
232 |         X_test_processed = np.hstack(
233 |             (X_test_categorical_one_hot, X_test_numerical_imputed_scaled)
234 |         )
235 | 
236 |         model = LogisticRegression(random_state=0).fit(X_train_processed, y_train)
237 |         y_train_estimation = model.predict(X_train_processed)
238 |         y_test_estimation = model.predict(X_test_processed)
239 | 
240 |         cm_train = confusion_matrix(y_train, y_train_estimation)
241 | 
242 |         cm_test = confusion_matrix(y_test, y_test_estimation)
243 | 
244 |         print('cm_train', cm_train)
245 |         print('cm_test', cm_test)
246 | 
247 |         do_test('../data/cm_test.pkl', cm_test)
248 |         do_test('../data/cm_train.pkl', cm_train)
249 |         do_test('../data/X_train_processed.pkl', X_train_processed)
250 |         do_test('../data/X_test_processed.pkl', X_test_processed)
251 | 
252 |         do_pandas_test('../data/df.pkl', df)
253 | 
254 | 
255 | def main(param: str = 'pass'):
256 |     titanic_model_creator = TitanicModelCreator(
257 |         loader=SqlLoader(connection_string='sqlite:///../data/titanic.db')
258 |     )
259 |     titanic_model_creator.run()
260 | 
261 | 
262 | def test_main(param: str = 'pass'):
263 |     titanic_model_creator = TitanicModelCreator(
264 |         loader=TestLoader(
265 |             passengers_filename='../data/passengers_with_is_survived.pkl',
266 |             real_loader=SqlLoader(connection_string='sqlite:///../data/titanic.db'),
267 |         )
268 |     )
269 |     titanic_model_creator.run()
270 | 
271 | 
272 | if __name__ == "__main__":
273 |     typer.run(test_main)
274 | 


--------------------------------------------------------------------------------
/Step12/titanic_model.sh:
--------------------------------------------------------------------------------
1 | python3.8 ./titanic_model.py $1
2 | 


--------------------------------------------------------------------------------
/Step13/README12.md:
--------------------------------------------------------------------------------
1 | ### Step 12: Remove any data that is not explicitly needed
2 | 
3 | - Update the query in `SqlLoader` to only retrieve the columns that will be used for the model's input
4 | 
5 | Simplifying down to the minimum is a goal of refactoring. Anything that is not explicitly needed should be removed. If the requirements change they can be added back again. For example the `ticket` column is in `df` but it is never used again in the program. Remove it.
6 | 


--------------------------------------------------------------------------------
/Step13/make_venv.sh:
--------------------------------------------------------------------------------
1 | python3.8 -m venv .venv
2 | source .venv/bin/activate
3 | pip3 install --upgrade pip
4 | pip3 install setuptools==57.1.0
5 | pip3 install wheel
6 | pip3 install -r requirements.txt
7 | 


--------------------------------------------------------------------------------
/Step13/requirements.txt:
--------------------------------------------------------------------------------
1 | typer
2 | scikit-learn
3 | jupyter
4 | pandas
5 | numpy
6 | sqlalchemy
7 | pydantic
8 | 


--------------------------------------------------------------------------------
/Step13/titanic_model.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import pickle
  3 | import typer
  4 | import numpy as np
  5 | import pandas as pd
  6 | from pydantic import BaseModel
  7 | from collections import Counter
  8 | from sqlalchemy import create_engine
  9 | 
 10 | from sklearn.model_selection import train_test_split
 11 | from sklearn.linear_model import LogisticRegression
 12 | from sklearn.preprocessing import RobustScaler
 13 | from sklearn.preprocessing import OneHotEncoder
 14 | from sklearn.impute import KNNImputer
 15 | from sklearn.metrics import confusion_matrix
 16 | 
 17 | 
 18 | class Passenger(BaseModel):
 19 |     pid: int
 20 |     pclass: int
 21 |     sex: str
 22 |     age: float
 23 |     family_size: int
 24 |     fare: float
 25 |     embarked: str
 26 |     is_alone: int
 27 |     title: str
 28 |     is_survived: int
 29 | 
 30 | 
 31 | def do_test(filename, data):
 32 |     if not os.path.isfile(filename):
 33 |         pickle.dump(data, open(filename, 'wb'))
 34 |     truth = pickle.load(open(filename, 'rb'))
 35 |     try:
 36 |         np.testing.assert_almost_equal(data, truth)
 37 |         print(f'{filename} test passed')
 38 |     except AssertionError as ex:
 39 |         print(f'{filename} test failed {ex}')
 40 | 
 41 | 
 42 | def do_pandas_test(filename, data):
 43 |     if not os.path.isfile(filename):
 44 |         data.to_pickle(filename)
 45 |     truth = pd.read_pickle(filename)
 46 |     try:
 47 |         pd.testing.assert_frame_equal(data, truth)
 48 |         print(f'{filename} pandas test passed')
 49 |     except AssertionError as ex:
 50 |         print(f'{filename} pandas test failed {ex}')
 51 | 
 52 | 
 53 | class SqlLoader:
 54 |     def __init__(self, connection_string):
 55 |         engine = create_engine(connection_string)
 56 |         self.connection = engine.connect()
 57 | 
 58 |     def get_passengers(self):
 59 |         query = """
 60 |             SELECT
 61 |                 tbl_passengers.pid,
 62 |                 tbl_passengers.pclass,
 63 |                 tbl_passengers.sex,
 64 |                 tbl_passengers.age,
 65 |                 tbl_passengers.parch,
 66 |                 tbl_passengers.sibsp,
 67 |                 tbl_passengers.fare,
 68 |                 tbl_passengers.embarked,
 69 |                 tbl_passengers.name,
 70 |                 tbl_targets.is_survived
 71 |             FROM
 72 |                 tbl_passengers
 73 |             JOIN
 74 |                 tbl_targets
 75 |             ON
 76 |                 tbl_passengers.pid=tbl_targets.pid
 77 |         """
 78 |         return pd.read_sql(query, con=self.connection)
 79 | 
 80 | 
 81 | class TestLoader:
 82 |     def __init__(self, passengers_filename, real_loader):
 83 |         self.passengers_filename = passengers_filename
 84 |         self.real_loader = real_loader
 85 |         if not os.path.isfile(self.passengers_filename):
 86 |             df = self.real_loader.get_passengers()
 87 |             df.to_pickle(self.passengers_filename)
 88 | 
 89 |     def get_passengers(self):
 90 |         return pd.read_pickle(self.passengers_filename)
 91 | 
 92 | 
 93 | class PassengerLoader:
 94 |     def __init__(self, loader, rare_titles=None):
 95 |         self.loader = loader
 96 |         self.rare_titles = rare_titles
 97 | 
 98 |     def get_passengers(self):
 99 |         passengers = []
100 |         for data in self.loader.get_passengers().itertuples():
101 |             # parch = Parents/Children, sibsp = Siblings/Spouses
102 |             family_size = int(data.parch + data.sibsp)
103 |             # Allen, Miss. Elisabeth Walton
104 |             title = data.name.split(',')[1].split('.')[0].strip()
105 |             passenger = Passenger(
106 |                 pid=int(data.pid),
107 |                 pclass=int(data.pclass),
108 |                 sex=str(data.sex),
109 |                 age=float(data.age),
110 |                 family_size=family_size,
111 |                 fare=float(data.fare),
112 |                 embarked=str(data.embarked),
113 |                 is_alone=1 if family_size == 1 else 0,
114 |                 title='rare' if title in self.rare_titles else title,
115 |                 is_survived=int(data.is_survived),
116 |             )
117 |             passengers.append(passenger)
118 |         return passengers
119 | 
120 | 
121 | class TitanicModelCreator:
122 |     def __init__(self, loader):
123 |         self.loader = loader
124 |         np.random.seed(42)
125 | 
126 |     def run(self):
127 |         df = self.loader.get_passengers()
128 | 
129 |         # parch = Parents/Children, sibsp = Siblings/Spouses
130 |         df['family_size'] = df['parch'] + df['sibsp']
131 |         df['is_alone'] = [
132 |             1 if family_size == 1 else 0 for family_size in df['family_size']
133 |         ]
134 | 
135 |         df['title'] = [name.split(',')[1].split('.')[0].strip() for name in df['name']]
136 |         rare_titles = {k for k, v in Counter(df['title']).items() if v < 10}
137 |         df['title'] = [
138 |             'rare' if title in rare_titles else title for title in df['title']
139 |         ]
140 | 
141 |         targets = [int(v) for v in df['is_survived']]
142 |         df = df[
143 |             [
144 |                 'pclass',
145 |                 'sex',
146 |                 'age',
147 |                 'family_size',
148 |                 'fare',
149 |                 'embarked',
150 |                 'is_alone',
151 |                 'title',
152 |             ]
153 |         ]
154 | 
155 |         X_train, X_test, y_train, y_test = train_test_split(
156 |             df, targets, stratify=targets, test_size=0.2
157 |         )
158 | 
159 |         X_train_categorical = X_train[
160 |             ['embarked', 'sex', 'pclass', 'title', 'is_alone']
161 |         ]
162 |         X_test_categorical = X_test[['embarked', 'sex', 'pclass', 'title', 'is_alone']]
163 | 
164 |         one_hot_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False).fit(
165 |             X_train_categorical
166 |         )
167 |         X_train_categorical_one_hot = one_hot_encoder.transform(X_train_categorical)
168 |         X_test_categorical_one_hot = one_hot_encoder.transform(X_test_categorical)
169 | 
170 |         X_train_numerical = X_train[['age', 'fare', 'family_size']]
171 |         X_test_numerical = X_test[['age', 'fare', 'family_size']]
172 |         knn_imputer = KNNImputer(n_neighbors=5).fit(X_train_numerical)
173 |         X_train_numerical_imputed = knn_imputer.transform(X_train_numerical)
174 |         X_test_numerical_imputed = knn_imputer.transform(X_test_numerical)
175 | 
176 |         robust_scaler = RobustScaler().fit(X_train_numerical_imputed)
177 |         X_train_numerical_imputed_scaled = robust_scaler.transform(
178 |             X_train_numerical_imputed
179 |         )
180 |         X_test_numerical_imputed_scaled = robust_scaler.transform(
181 |             X_test_numerical_imputed
182 |         )
183 | 
184 |         X_train_processed = np.hstack(
185 |             (X_train_categorical_one_hot, X_train_numerical_imputed_scaled)
186 |         )
187 |         X_test_processed = np.hstack(
188 |             (X_test_categorical_one_hot, X_test_numerical_imputed_scaled)
189 |         )
190 | 
191 |         model = LogisticRegression(random_state=0).fit(X_train_processed, y_train)
192 |         y_train_estimation = model.predict(X_train_processed)
193 |         y_test_estimation = model.predict(X_test_processed)
194 | 
195 |         cm_train = confusion_matrix(y_train, y_train_estimation)
196 | 
197 |         cm_test = confusion_matrix(y_test, y_test_estimation)
198 | 
199 |         print('cm_train', cm_train)
200 |         print('cm_test', cm_test)
201 | 
202 |         do_test('../data/cm_test.pkl', cm_test)
203 |         do_test('../data/cm_train.pkl', cm_train)
204 |         do_test('../data/X_train_processed.pkl', X_train_processed)
205 |         do_test('../data/X_test_processed.pkl', X_test_processed)
206 | 
207 |         do_pandas_test('../data/df.pkl', df)
208 | 
209 | 
210 | def main(param: str = 'pass'):
211 |     titanic_model_creator = TitanicModelCreator(
212 |         loader=SqlLoader(connection_string='sqlite:///../data/titanic.db')
213 |     )
214 |     titanic_model_creator.run()
215 | 
216 | 
217 | def test_main(param: str = 'pass'):
218 |     titanic_model_creator = TitanicModelCreator(
219 |         loader=TestLoader(
220 |             passengers_filename='../data/passengers_with_is_survived.pkl',
221 |             real_loader=SqlLoader(connection_string='sqlite:///../data/titanic.db'),
222 |         )
223 |     )
224 |     titanic_model_creator.run()
225 | 
226 | 
227 | if __name__ == "__main__":
228 |     typer.run(test_main)
229 | 


--------------------------------------------------------------------------------
/Step13/titanic_model.sh:
--------------------------------------------------------------------------------
1 | python3.8 ./titanic_model.py $1
2 | 


--------------------------------------------------------------------------------
/Step14/README13.md:
--------------------------------------------------------------------------------
1 | ### Step 13: Use Passenger objects in the program
2 | 
3 | - Add `PassengerLoader` to `main` and `test_main`
4 | - Add the `RARE_TITLES` constant
5 | - Convert the classes back into the `df` dataframe with `passenger.dict()`
6 | 
7 | It is very important to do refactoring incrementally. Any change should be small enough that if the tests fail the source can be found quickly. So for now we stop at using the new loader but do not change anything else.
8 | 


--------------------------------------------------------------------------------
/Step14/make_venv.sh:
--------------------------------------------------------------------------------
1 | python3.8 -m venv .venv
2 | source .venv/bin/activate
3 | pip3 install --upgrade pip
4 | pip3 install setuptools==57.1.0
5 | pip3 install wheel
6 | pip3 install -r requirements.txt
7 | 


--------------------------------------------------------------------------------
/Step14/requirements.txt:
--------------------------------------------------------------------------------
1 | typer
2 | scikit-learn
3 | jupyter
4 | pandas
5 | numpy
6 | sqlalchemy
7 | pydantic
8 | 


--------------------------------------------------------------------------------
/Step14/titanic_model.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import pickle
  3 | import typer
  4 | import numpy as np
  5 | import pandas as pd
  6 | from pydantic import BaseModel
  7 | from sqlalchemy import create_engine
  8 | 
  9 | from sklearn.model_selection import train_test_split
 10 | from sklearn.linear_model import LogisticRegression
 11 | from sklearn.preprocessing import RobustScaler
 12 | from sklearn.preprocessing import OneHotEncoder
 13 | from sklearn.impute import KNNImputer
 14 | from sklearn.metrics import confusion_matrix
 15 | 
 16 | 
 17 | RARE_TITLES = {
 18 |     'Capt',
 19 |     'Col',
 20 |     'Don',
 21 |     'Dona',
 22 |     'Dr',
 23 |     'Jonkheer',
 24 |     'Lady',
 25 |     'Major',
 26 |     'Mlle',
 27 |     'Mme',
 28 |     'Ms',
 29 |     'Rev',
 30 |     'Sir',
 31 |     'the Countess',
 32 | }
 33 | 
 34 | 
 35 | class Passenger(BaseModel):
 36 |     pid: int
 37 |     pclass: int
 38 |     sex: str
 39 |     age: float
 40 |     family_size: int
 41 |     fare: float
 42 |     embarked: str
 43 |     is_alone: int
 44 |     title: str
 45 |     is_survived: int
 46 | 
 47 | 
 48 | def do_test(filename, data):
 49 |     if not os.path.isfile(filename):
 50 |         pickle.dump(data, open(filename, 'wb'))
 51 |     truth = pickle.load(open(filename, 'rb'))
 52 |     try:
 53 |         np.testing.assert_almost_equal(data, truth)
 54 |         print(f'{filename} test passed')
 55 |     except AssertionError as ex:
 56 |         print(f'{filename} test failed {ex}')
 57 | 
 58 | 
 59 | def do_pandas_test(filename, data):
 60 |     if not os.path.isfile(filename):
 61 |         data.to_pickle(filename)
 62 |     truth = pd.read_pickle(filename)
 63 |     try:
 64 |         pd.testing.assert_frame_equal(data, truth)
 65 |         print(f'{filename} pandas test passed')
 66 |     except AssertionError as ex:
 67 |         print(f'{filename} pandas test failed {ex}')
 68 | 
 69 | 
 70 | class SqlLoader:
 71 |     def __init__(self, connection_string):
 72 |         engine = create_engine(connection_string)
 73 |         self.connection = engine.connect()
 74 | 
 75 |     def get_passengers(self):
 76 |         query = """
 77 |             SELECT
 78 |                 tbl_passengers.pid,
 79 |                 tbl_passengers.pclass,
 80 |                 tbl_passengers.sex,
 81 |                 tbl_passengers.age,
 82 |                 tbl_passengers.parch,
 83 |                 tbl_passengers.sibsp,
 84 |                 tbl_passengers.fare,
 85 |                 tbl_passengers.embarked,
 86 |                 tbl_passengers.name,
 87 |                 tbl_targets.is_survived
 88 |             FROM
 89 |                 tbl_passengers
 90 |             JOIN
 91 |                 tbl_targets
 92 |             ON
 93 |                 tbl_passengers.pid=tbl_targets.pid
 94 |         """
 95 |         return pd.read_sql(query, con=self.connection)
 96 | 
 97 | 
 98 | class TestLoader:
 99 |     def __init__(self, passengers_filename, real_loader):
100 |         self.passengers_filename = passengers_filename
101 |         self.real_loader = real_loader
102 |         if not os.path.isfile(self.passengers_filename):
103 |             df = self.real_loader.get_passengers()
104 |             df.to_pickle(self.passengers_filename)
105 | 
106 |     def get_passengers(self):
107 |         return pd.read_pickle(self.passengers_filename)
108 | 
109 | 
110 | class PassengerLoader:
111 |     def __init__(self, loader, rare_titles=None):
112 |         self.loader = loader
113 |         self.rare_titles = rare_titles
114 | 
115 |     def get_passengers(self):
116 |         passengers = []
117 |         for data in self.loader.get_passengers().itertuples():
118 |             # parch = Parents/Children, sibsp = Siblings/Spouses
119 |             family_size = int(data.parch + data.sibsp)
120 |             # Allen, Miss. Elisabeth Walton
121 |             title = data.name.split(',')[1].split('.')[0].strip()
122 |             passenger = Passenger(
123 |                 pid=int(data.pid),
124 |                 pclass=int(data.pclass),
125 |                 sex=str(data.sex),
126 |                 age=float(data.age),
127 |                 family_size=family_size,
128 |                 fare=float(data.fare),
129 |                 embarked=str(data.embarked),
130 |                 is_alone=1 if family_size == 1 else 0,
131 |                 title='rare' if title in self.rare_titles else title,
132 |                 is_survived=int(data.is_survived),
133 |             )
134 |             passengers.append(passenger)
135 |         return passengers
136 | 
137 | 
138 | class TitanicModelCreator:
139 |     def __init__(self, loader):
140 |         self.loader = loader
141 |         np.random.seed(42)
142 | 
143 |     def run(self):
144 |         df = pd.DataFrame([v.dict() for v in self.loader.get_passengers()])
145 |         targets = [int(v) for v in df['is_survived']]
146 | 
147 |         X_train, X_test, y_train, y_test = train_test_split(
148 |             df, targets, stratify=targets, test_size=0.2
149 |         )
150 | 
151 |         X_train_categorical = X_train[
152 |             ['embarked', 'sex', 'pclass', 'title', 'is_alone']
153 |         ]
154 |         X_test_categorical = X_test[['embarked', 'sex', 'pclass', 'title', 'is_alone']]
155 | 
156 |         one_hot_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False).fit(
157 |             X_train_categorical
158 |         )
159 |         X_train_categorical_one_hot = one_hot_encoder.transform(X_train_categorical)
160 |         X_test_categorical_one_hot = one_hot_encoder.transform(X_test_categorical)
161 | 
162 |         X_train_numerical = X_train[['age', 'fare', 'family_size']]
163 |         X_test_numerical = X_test[['age', 'fare', 'family_size']]
164 |         knn_imputer = KNNImputer(n_neighbors=5).fit(X_train_numerical)
165 |         X_train_numerical_imputed = knn_imputer.transform(X_train_numerical)
166 |         X_test_numerical_imputed = knn_imputer.transform(X_test_numerical)
167 | 
168 |         robust_scaler = RobustScaler().fit(X_train_numerical_imputed)
169 |         X_train_numerical_imputed_scaled = robust_scaler.transform(
170 |             X_train_numerical_imputed
171 |         )
172 |         X_test_numerical_imputed_scaled = robust_scaler.transform(
173 |             X_test_numerical_imputed
174 |         )
175 | 
176 |         X_train_processed = np.hstack(
177 |             (X_train_categorical_one_hot, X_train_numerical_imputed_scaled)
178 |         )
179 |         X_test_processed = np.hstack(
180 |             (X_test_categorical_one_hot, X_test_numerical_imputed_scaled)
181 |         )
182 | 
183 |         model = LogisticRegression(random_state=0).fit(X_train_processed, y_train)
184 |         y_train_estimation = model.predict(X_train_processed)
185 |         y_test_estimation = model.predict(X_test_processed)
186 | 
187 |         cm_train = confusion_matrix(y_train, y_train_estimation)
188 | 
189 |         cm_test = confusion_matrix(y_test, y_test_estimation)
190 | 
191 |         print('cm_train', cm_train)
192 |         print('cm_test', cm_test)
193 | 
194 |         do_test('../data/cm_test.pkl', cm_test)
195 |         do_test('../data/cm_train.pkl', cm_train)
196 |         do_test('../data/X_train_processed.pkl', X_train_processed)
197 |         do_test('../data/X_test_processed.pkl', X_test_processed)
198 | 
199 |         do_pandas_test('../data/df_no_tickets.pkl', df)
200 | 
201 | 
202 | def main(param: str = 'pass'):
203 |     titanic_model_creator = TitanicModelCreator(
204 |         loader=PassengerLoader(
205 |             loader=SqlLoader(connection_string='sqlite:///../data/titanic.db'),
206 |             rare_titles=RARE_TITLES,
207 |         )
208 |     )
209 |     titanic_model_creator.run()
210 | 
211 | 
212 | def test_main(param: str = 'pass'):
213 |     titanic_model_creator = TitanicModelCreator(
214 |         loader=PassengerLoader(
215 |             loader=TestLoader(
216 |                 passengers_filename='../data/passengers_with_is_survived.pkl',
217 |                 real_loader=SqlLoader(connection_string='sqlite:///../data/titanic.db'),
218 |             ),
219 |             rare_titles=RARE_TITLES,
220 |         )
221 |     )
222 |     titanic_model_creator.run()
223 | 
224 | 
225 | if __name__ == "__main__":
226 |     typer.run(test_main)
227 | 


--------------------------------------------------------------------------------
/Step14/titanic_model.sh:
--------------------------------------------------------------------------------
1 | python3.8 ./titanic_model.py $1
2 | 


--------------------------------------------------------------------------------
/Step15/README14.md:
--------------------------------------------------------------------------------
1 | ### Step 14: Separate training and evaluation functions
2 | 
3 | - Move all code related to evaluation (variables that has `_test_` in their name) into one group
4 | 
5 | After creating the model first it is trained, then it is evaluated on the training data, then it is evaluated on the testing data. These should be separated from each other into their own logical place. This will prepare to move them into an actually separated place.
6 | 


--------------------------------------------------------------------------------
/Step15/make_venv.sh:
--------------------------------------------------------------------------------
1 | python3.8 -m venv .venv
2 | source .venv/bin/activate
3 | pip3 install --upgrade pip
4 | pip3 install setuptools==57.1.0
5 | pip3 install wheel
6 | pip3 install -r requirements.txt
7 | 


--------------------------------------------------------------------------------
/Step15/requirements.txt:
--------------------------------------------------------------------------------
1 | typer
2 | scikit-learn
3 | jupyter
4 | pandas
5 | numpy
6 | sqlalchemy
7 | pydantic
8 | 


--------------------------------------------------------------------------------
/Step15/titanic_model.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import pickle
  3 | import typer
  4 | import numpy as np
  5 | import pandas as pd
  6 | from pydantic import BaseModel
  7 | from sqlalchemy import create_engine
  8 | 
  9 | from sklearn.model_selection import train_test_split
 10 | from sklearn.linear_model import LogisticRegression
 11 | from sklearn.preprocessing import RobustScaler
 12 | from sklearn.preprocessing import OneHotEncoder
 13 | from sklearn.impute import KNNImputer
 14 | from sklearn.metrics import confusion_matrix
 15 | 
 16 | 
 17 | RARE_TITLES = {
 18 |     'Capt',
 19 |     'Col',
 20 |     'Don',
 21 |     'Dona',
 22 |     'Dr',
 23 |     'Jonkheer',
 24 |     'Lady',
 25 |     'Major',
 26 |     'Mlle',
 27 |     'Mme',
 28 |     'Ms',
 29 |     'Rev',
 30 |     'Sir',
 31 |     'the Countess',
 32 | }
 33 | 
 34 | 
 35 | class Passenger(BaseModel):
 36 |     pid: int
 37 |     pclass: int
 38 |     sex: str
 39 |     age: float
 40 |     family_size: int
 41 |     fare: float
 42 |     embarked: str
 43 |     is_alone: int
 44 |     title: str
 45 |     is_survived: int
 46 | 
 47 | 
 48 | def do_test(filename, data):
 49 |     if not os.path.isfile(filename):
 50 |         pickle.dump(data, open(filename, 'wb'))
 51 |     truth = pickle.load(open(filename, 'rb'))
 52 |     try:
 53 |         np.testing.assert_almost_equal(data, truth)
 54 |         print(f'{filename} test passed')
 55 |     except AssertionError as ex:
 56 |         print(f'{filename} test failed {ex}')
 57 | 
 58 | 
 59 | def do_pandas_test(filename, data):
 60 |     if not os.path.isfile(filename):
 61 |         data.to_pickle(filename)
 62 |     truth = pd.read_pickle(filename)
 63 |     try:
 64 |         pd.testing.assert_frame_equal(data, truth)
 65 |         print(f'{filename} pandas test passed')
 66 |     except AssertionError as ex:
 67 |         print(f'{filename} pandas test failed {ex}')
 68 | 
 69 | 
 70 | class SqlLoader:
 71 |     def __init__(self, connection_string):
 72 |         engine = create_engine(connection_string)
 73 |         self.connection = engine.connect()
 74 | 
 75 |     def get_passengers(self):
 76 |         query = """
 77 |             SELECT
 78 |                 tbl_passengers.pid,
 79 |                 tbl_passengers.pclass,
 80 |                 tbl_passengers.sex,
 81 |                 tbl_passengers.age,
 82 |                 tbl_passengers.parch,
 83 |                 tbl_passengers.sibsp,
 84 |                 tbl_passengers.fare,
 85 |                 tbl_passengers.embarked,
 86 |                 tbl_passengers.name,
 87 |                 tbl_targets.is_survived
 88 |             FROM
 89 |                 tbl_passengers
 90 |             JOIN
 91 |                 tbl_targets
 92 |             ON
 93 |                 tbl_passengers.pid=tbl_targets.pid
 94 |         """
 95 |         return pd.read_sql(query, con=self.connection)
 96 | 
 97 | 
 98 | class TestLoader:
 99 |     def __init__(self, passengers_filename, real_loader):
100 |         self.passengers_filename = passengers_filename
101 |         self.real_loader = real_loader
102 |         if not os.path.isfile(self.passengers_filename):
103 |             df = self.real_loader.get_passengers()
104 |             df.to_pickle(self.passengers_filename)
105 | 
106 |     def get_passengers(self):
107 |         return pd.read_pickle(self.passengers_filename)
108 | 
109 | 
110 | class PassengerLoader:
111 |     def __init__(self, loader, rare_titles=None):
112 |         self.loader = loader
113 |         self.rare_titles = rare_titles
114 | 
115 |     def get_passengers(self):
116 |         passengers = []
117 |         for data in self.loader.get_passengers().itertuples():
118 |             # parch = Parents/Children, sibsp = Siblings/Spouses
119 |             family_size = int(data.parch + data.sibsp)
120 |             # Allen, Miss. Elisabeth Walton
121 |             title = data.name.split(',')[1].split('.')[0].strip()
122 |             passenger = Passenger(
123 |                 pid=int(data.pid),
124 |                 pclass=int(data.pclass),
125 |                 sex=str(data.sex),
126 |                 age=float(data.age),
127 |                 family_size=family_size,
128 |                 fare=float(data.fare),
129 |                 embarked=str(data.embarked),
130 |                 is_alone=1 if family_size == 1 else 0,
131 |                 title='rare' if title in self.rare_titles else title,
132 |                 is_survived=int(data.is_survived),
133 |             )
134 |             passengers.append(passenger)
135 |         return passengers
136 | 
137 | 
138 | class TitanicModelCreator:
139 |     def __init__(self, loader):
140 |         self.loader = loader
141 |         np.random.seed(42)
142 | 
143 |     def run(self):
144 |         df = pd.DataFrame([v.dict() for v in self.loader.get_passengers()])
145 |         targets = [int(v) for v in df['is_survived']]
146 | 
147 |         X_train, X_test, y_train, y_test = train_test_split(
148 |             df, targets, stratify=targets, test_size=0.2
149 |         )
150 | 
151 |         # --- TRAINING ---
152 |         X_train_categorical = X_train[
153 |             ['embarked', 'sex', 'pclass', 'title', 'is_alone']
154 |         ]
155 | 
156 |         one_hot_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False).fit(
157 |             X_train_categorical
158 |         )
159 |         X_train_categorical_one_hot = one_hot_encoder.transform(X_train_categorical)
160 | 
161 |         X_train_numerical = X_train[['age', 'fare', 'family_size']]
162 |         knn_imputer = KNNImputer(n_neighbors=5).fit(X_train_numerical)
163 |         X_train_numerical_imputed = knn_imputer.transform(X_train_numerical)
164 | 
165 |         robust_scaler = RobustScaler().fit(X_train_numerical_imputed)
166 |         X_train_numerical_imputed_scaled = robust_scaler.transform(
167 |             X_train_numerical_imputed
168 |         )
169 | 
170 |         X_train_processed = np.hstack(
171 |             (X_train_categorical_one_hot, X_train_numerical_imputed_scaled)
172 |         )
173 | 
174 |         model = LogisticRegression(random_state=0).fit(X_train_processed, y_train)
175 |         y_train_estimation = model.predict(X_train_processed)
176 | 
177 |         cm_train = confusion_matrix(y_train, y_train_estimation)
178 | 
179 |         # --- TESTING ---
180 |         X_test_categorical = X_test[['embarked', 'sex', 'pclass', 'title', 'is_alone']]
181 |         X_test_categorical_one_hot = one_hot_encoder.transform(X_test_categorical)
182 | 
183 |         X_test_numerical = X_test[['age', 'fare', 'family_size']]
184 |         X_test_numerical_imputed = knn_imputer.transform(X_test_numerical)
185 |         X_test_numerical_imputed_scaled = robust_scaler.transform(
186 |             X_test_numerical_imputed
187 |         )
188 | 
189 |         X_test_processed = np.hstack(
190 |             (X_test_categorical_one_hot, X_test_numerical_imputed_scaled)
191 |         )
192 | 
193 |         y_test_estimation = model.predict(X_test_processed)
194 |         cm_test = confusion_matrix(y_test, y_test_estimation)
195 | 
196 |         print('cm_train', cm_train)
197 |         print('cm_test', cm_test)
198 | 
199 |         do_test('../data/cm_test.pkl', cm_test)
200 |         do_test('../data/cm_train.pkl', cm_train)
201 |         do_test('../data/X_train_processed.pkl', X_train_processed)
202 |         do_test('../data/X_test_processed.pkl', X_test_processed)
203 | 
204 |         do_pandas_test('../data/df_no_tickets.pkl', df)
205 | 
206 | 
207 | def main(param: str = 'pass'):
208 |     titanic_model_creator = TitanicModelCreator(
209 |         loader=PassengerLoader(
210 |             loader=SqlLoader(connection_string='sqlite:///../data/titanic.db'),
211 |             rare_titles=RARE_TITLES,
212 |         )
213 |     )
214 |     titanic_model_creator.run()
215 | 
216 | 
217 | def test_main(param: str = 'pass'):
218 |     titanic_model_creator = TitanicModelCreator(
219 |         loader=PassengerLoader(
220 |             loader=TestLoader(
221 |                 passengers_filename='../data/passengers_with_is_survived.pkl',
222 |                 real_loader=SqlLoader(connection_string='sqlite:///../data/titanic.db'),
223 |             ),
224 |             rare_titles=RARE_TITLES,
225 |         )
226 |     )
227 |     titanic_model_creator.run()
228 | 
229 | 
230 | if __name__ == "__main__":
231 |     typer.run(test_main)
232 | 


--------------------------------------------------------------------------------
/Step15/titanic_model.sh:
--------------------------------------------------------------------------------
1 | python3.8 ./titanic_model.py $1
2 | 


--------------------------------------------------------------------------------
/Step16/README15.md:
--------------------------------------------------------------------------------
1 | ### Step 15: Create `TitanicModel` class
2 | 
3 | - Create a class that has all the `sklearn` components as member variables
4 | - Instantiate these before the "Training" block
5 | - Use these instead of the local ones
6 | 
7 | The goal of the whole program is to create a model, despite this until now there was no single object describing this model. The next steps is to establish the concept of this model and what kind of services it is providing for `TitanicModelCreator`.
8 | 


--------------------------------------------------------------------------------
/Step16/make_venv.sh:
--------------------------------------------------------------------------------
1 | python3.8 -m venv .venv
2 | source .venv/bin/activate
3 | pip3 install --upgrade pip
4 | pip3 install setuptools==57.1.0
5 | pip3 install wheel
6 | pip3 install -r requirements.txt
7 | 


--------------------------------------------------------------------------------
/Step16/requirements.txt:
--------------------------------------------------------------------------------
1 | typer
2 | scikit-learn
3 | jupyter
4 | pandas
5 | numpy
6 | sqlalchemy
7 | pydantic
8 | 


--------------------------------------------------------------------------------
/Step16/titanic_model.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import pickle
  3 | import typer
  4 | import numpy as np
  5 | import pandas as pd
  6 | from pydantic import BaseModel
  7 | from sqlalchemy import create_engine
  8 | 
  9 | from sklearn.model_selection import train_test_split
 10 | from sklearn.linear_model import LogisticRegression
 11 | from sklearn.preprocessing import RobustScaler
 12 | from sklearn.preprocessing import OneHotEncoder
 13 | from sklearn.impute import KNNImputer
 14 | from sklearn.metrics import confusion_matrix
 15 | 
 16 | 
 17 | RARE_TITLES = {
 18 |     'Capt',
 19 |     'Col',
 20 |     'Don',
 21 |     'Dona',
 22 |     'Dr',
 23 |     'Jonkheer',
 24 |     'Lady',
 25 |     'Major',
 26 |     'Mlle',
 27 |     'Mme',
 28 |     'Ms',
 29 |     'Rev',
 30 |     'Sir',
 31 |     'the Countess',
 32 | }
 33 | 
 34 | 
 35 | class Passenger(BaseModel):
 36 |     pid: int
 37 |     pclass: int
 38 |     sex: str
 39 |     age: float
 40 |     family_size: int
 41 |     fare: float
 42 |     embarked: str
 43 |     is_alone: int
 44 |     title: str
 45 |     is_survived: int
 46 | 
 47 | 
 48 | def do_test(filename, data):
 49 |     if not os.path.isfile(filename):
 50 |         pickle.dump(data, open(filename, 'wb'))
 51 |     truth = pickle.load(open(filename, 'rb'))
 52 |     try:
 53 |         np.testing.assert_almost_equal(data, truth)
 54 |         print(f'{filename} test passed')
 55 |     except AssertionError as ex:
 56 |         print(f'{filename} test failed {ex}')
 57 | 
 58 | 
 59 | def do_pandas_test(filename, data):
 60 |     if not os.path.isfile(filename):
 61 |         data.to_pickle(filename)
 62 |     truth = pd.read_pickle(filename)
 63 |     try:
 64 |         pd.testing.assert_frame_equal(data, truth)
 65 |         print(f'{filename} pandas test passed')
 66 |     except AssertionError as ex:
 67 |         print(f'{filename} pandas test failed {ex}')
 68 | 
 69 | 
 70 | class SqlLoader:
 71 |     def __init__(self, connection_string):
 72 |         engine = create_engine(connection_string)
 73 |         self.connection = engine.connect()
 74 | 
 75 |     def get_passengers(self):
 76 |         query = """
 77 |             SELECT
 78 |                 tbl_passengers.pid,
 79 |                 tbl_passengers.pclass,
 80 |                 tbl_passengers.sex,
 81 |                 tbl_passengers.age,
 82 |                 tbl_passengers.parch,
 83 |                 tbl_passengers.sibsp,
 84 |                 tbl_passengers.fare,
 85 |                 tbl_passengers.embarked,
 86 |                 tbl_passengers.name,
 87 |                 tbl_targets.is_survived
 88 |             FROM
 89 |                 tbl_passengers
 90 |             JOIN
 91 |                 tbl_targets
 92 |             ON
 93 |                 tbl_passengers.pid=tbl_targets.pid
 94 |         """
 95 |         return pd.read_sql(query, con=self.connection)
 96 | 
 97 | 
 98 | class TestLoader:
 99 |     def __init__(self, passengers_filename, real_loader):
100 |         self.passengers_filename = passengers_filename
101 |         self.real_loader = real_loader
102 |         if not os.path.isfile(self.passengers_filename):
103 |             df = self.real_loader.get_passengers()
104 |             df.to_pickle(self.passengers_filename)
105 | 
106 |     def get_passengers(self):
107 |         return pd.read_pickle(self.passengers_filename)
108 | 
109 | 
110 | class PassengerLoader:
111 |     def __init__(self, loader, rare_titles=None):
112 |         self.loader = loader
113 |         self.rare_titles = rare_titles
114 | 
115 |     def get_passengers(self):
116 |         passengers = []
117 |         for data in self.loader.get_passengers().itertuples():
118 |             # parch = Parents/Children, sibsp = Siblings/Spouses
119 |             family_size = int(data.parch + data.sibsp)
120 |             # Allen, Miss. Elisabeth Walton
121 |             title = data.name.split(',')[1].split('.')[0].strip()
122 |             passenger = Passenger(
123 |                 pid=int(data.pid),
124 |                 pclass=int(data.pclass),
125 |                 sex=str(data.sex),
126 |                 age=float(data.age),
127 |                 family_size=family_size,
128 |                 fare=float(data.fare),
129 |                 embarked=str(data.embarked),
130 |                 is_alone=1 if family_size == 1 else 0,
131 |                 title='rare' if title in self.rare_titles else title,
132 |                 is_survived=int(data.is_survived),
133 |             )
134 |             passengers.append(passenger)
135 |         return passengers
136 | 
137 | 
138 | class TitanicModel:
139 |     def __init__(self):
140 |         self.one_hot_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
141 |         self.knn_imputer = KNNImputer(n_neighbors=5)
142 |         self.robust_scaler = RobustScaler()
143 |         self.predictor = LogisticRegression(random_state=0)
144 | 
145 |     def train(self):
146 |         pass
147 | 
148 |     def estimate(self, passengers):
149 |         return 1
150 | 
151 | 
152 | class TitanicModelCreator:
153 |     def __init__(self, loader):
154 |         self.loader = loader
155 |         np.random.seed(42)
156 | 
157 |     def run(self):
158 |         df = pd.DataFrame([v.dict() for v in self.loader.get_passengers()])
159 |         targets = [int(v) for v in df['is_survived']]
160 | 
161 |         X_train, X_test, y_train, y_test = train_test_split(
162 |             df, targets, stratify=targets, test_size=0.2
163 |         )
164 | 
165 |         # --- TRAINING ---
166 |         model = TitanicModel()
167 | 
168 |         X_train_categorical = X_train[
169 |             ['embarked', 'sex', 'pclass', 'title', 'is_alone']
170 |         ]
171 | 
172 |         model.one_hot_encoder.fit(X_train_categorical)
173 |         X_train_categorical_one_hot = model.one_hot_encoder.transform(X_train_categorical)
174 | 
175 |         X_train_numerical = X_train[['age', 'fare', 'family_size']]
176 |         model.knn_imputer.fit(X_train_numerical)
177 |         X_train_numerical_imputed = model.knn_imputer.transform(X_train_numerical)
178 | 
179 |         model.robust_scaler.fit(X_train_numerical_imputed)
180 |         X_train_numerical_imputed_scaled = model.robust_scaler.transform(
181 |             X_train_numerical_imputed
182 |         )
183 | 
184 |         X_train_processed = np.hstack(
185 |             (X_train_categorical_one_hot, X_train_numerical_imputed_scaled)
186 |         )
187 | 
188 |         model.predictor.fit(X_train_processed, y_train)
189 |         y_train_estimation = model.predictor.predict(X_train_processed)
190 | 
191 |         cm_train = confusion_matrix(y_train, y_train_estimation)
192 | 
193 |         # --- TESTING ---
194 |         X_test_categorical = X_test[['embarked', 'sex', 'pclass', 'title', 'is_alone']]
195 |         X_test_categorical_one_hot = model.one_hot_encoder.transform(X_test_categorical)
196 | 
197 |         X_test_numerical = X_test[['age', 'fare', 'family_size']]
198 |         X_test_numerical_imputed = model.knn_imputer.transform(X_test_numerical)
199 |         X_test_numerical_imputed_scaled = model.robust_scaler.transform(
200 |             X_test_numerical_imputed
201 |         )
202 | 
203 |         X_test_processed = np.hstack(
204 |             (X_test_categorical_one_hot, X_test_numerical_imputed_scaled)
205 |         )
206 | 
207 |         y_test_estimation = model.predictor.predict(X_test_processed)
208 |         cm_test = confusion_matrix(y_test, y_test_estimation)
209 | 
210 |         print('cm_train', cm_train)
211 |         print('cm_test', cm_test)
212 | 
213 |         do_test('../data/cm_test.pkl', cm_test)
214 |         do_test('../data/cm_train.pkl', cm_train)
215 |         do_test('../data/X_train_processed.pkl', X_train_processed)
216 |         do_test('../data/X_test_processed.pkl', X_test_processed)
217 | 
218 |         ('../data/df_no_tickets.pkl', df)
219 | 
220 | 
221 | def main(param: str = 'pass'):
222 |     titanic_model_creator = TitanicModelCreator(
223 |         loader=PassengerLoader(
224 |             loader=SqlLoader(connection_string='sqlite:///../data/titanic.db'),
225 |             rare_titles=RARE_TITLES,
226 |         )
227 |     )
228 |     titanic_model_creator.run()
229 | 
230 | 
231 | def test_main(param: str = 'pass'):
232 |     titanic_model_creator = TitanicModelCreator(
233 |         loader=PassengerLoader(
234 |             loader=TestLoader(
235 |                 passengers_filename='../data/passengers_with_is_survived.pkl',
236 |                 real_loader=SqlLoader(connection_string='sqlite:///../data/titanic.db'),
237 |             ),
238 |             rare_titles=RARE_TITLES,
239 |         )
240 |     )
241 |     titanic_model_creator.run()
242 | 
243 | 
244 | if __name__ == "__main__":
245 |     typer.run(test_main)
246 | 


--------------------------------------------------------------------------------
/Step16/titanic_model.sh:
--------------------------------------------------------------------------------
1 | python3.8 ./titanic_model.py $1
2 | 


--------------------------------------------------------------------------------
/Step17/README16.md:
--------------------------------------------------------------------------------
1 | ### Step 16: Passenger class based training and evaluation sets
2 | 
3 | - Create a function in `TitanicModelCreator` that splits the `passengers` stratified by the "targets" (namely if the passenger survived or not)
4 | - Refactor `X_train/X_test` to be created from these lists of passengers
5 | 
6 | Because `train_test_split` works on lists, we extract the pids and the targets from the classes and create the two sets from a mapping from pids to passenger classes.
7 | 


--------------------------------------------------------------------------------
/Step17/make_venv.sh:
--------------------------------------------------------------------------------
1 | python3.8 -m venv .venv
2 | source .venv/bin/activate
3 | pip3 install --upgrade pip
4 | pip3 install setuptools==57.1.0
5 | pip3 install wheel
6 | pip3 install -r requirements.txt
7 | 


--------------------------------------------------------------------------------
/Step17/requirements.txt:
--------------------------------------------------------------------------------
1 | typer
2 | scikit-learn
3 | jupyter
4 | pandas
5 | numpy
6 | sqlalchemy
7 | pydantic
8 | 


--------------------------------------------------------------------------------
/Step17/titanic_model.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import pickle
  3 | import typer
  4 | import numpy as np
  5 | import pandas as pd
  6 | from pydantic import BaseModel
  7 | from sqlalchemy import create_engine
  8 | 
  9 | from sklearn.model_selection import train_test_split
 10 | from sklearn.linear_model import LogisticRegression
 11 | from sklearn.preprocessing import RobustScaler
 12 | from sklearn.preprocessing import OneHotEncoder
 13 | from sklearn.impute import KNNImputer
 14 | from sklearn.metrics import confusion_matrix
 15 | 
 16 | 
 17 | RARE_TITLES = {
 18 |     'Capt',
 19 |     'Col',
 20 |     'Don',
 21 |     'Dona',
 22 |     'Dr',
 23 |     'Jonkheer',
 24 |     'Lady',
 25 |     'Major',
 26 |     'Mlle',
 27 |     'Mme',
 28 |     'Ms',
 29 |     'Rev',
 30 |     'Sir',
 31 |     'the Countess',
 32 | }
 33 | 
 34 | 
 35 | class Passenger(BaseModel):
 36 |     pid: int
 37 |     pclass: int
 38 |     sex: str
 39 |     age: float
 40 |     family_size: int
 41 |     fare: float
 42 |     embarked: str
 43 |     is_alone: int
 44 |     title: str
 45 |     is_survived: int
 46 | 
 47 | 
 48 | def do_test(filename, data):
 49 |     if not os.path.isfile(filename):
 50 |         pickle.dump(data, open(filename, 'wb'))
 51 |     truth = pickle.load(open(filename, 'rb'))
 52 |     try:
 53 |         np.testing.assert_almost_equal(data, truth)
 54 |         print(f'{filename} test passed')
 55 |     except AssertionError as ex:
 56 |         print(f'{filename} test failed {ex}')
 57 | 
 58 | 
 59 | def do_pandas_test(filename, data):
 60 |     if not os.path.isfile(filename):
 61 |         data.to_pickle(filename)
 62 |     truth = pd.read_pickle(filename)
 63 |     try:
 64 |         pd.testing.assert_frame_equal(data, truth)
 65 |         print(f'{filename} pandas test passed')
 66 |     except AssertionError as ex:
 67 |         print(f'{filename} pandas test failed {ex}')
 68 | 
 69 | 
 70 | class SqlLoader:
 71 |     def __init__(self, connection_string):
 72 |         engine = create_engine(connection_string)
 73 |         self.connection = engine.connect()
 74 | 
 75 |     def get_passengers(self):
 76 |         query = """
 77 |             SELECT
 78 |                 tbl_passengers.pid,
 79 |                 tbl_passengers.pclass,
 80 |                 tbl_passengers.sex,
 81 |                 tbl_passengers.age,
 82 |                 tbl_passengers.parch,
 83 |                 tbl_passengers.sibsp,
 84 |                 tbl_passengers.fare,
 85 |                 tbl_passengers.embarked,
 86 |                 tbl_passengers.name,
 87 |                 tbl_targets.is_survived
 88 |             FROM
 89 |                 tbl_passengers
 90 |             JOIN
 91 |                 tbl_targets
 92 |             ON
 93 |                 tbl_passengers.pid=tbl_targets.pid
 94 |         """
 95 |         return pd.read_sql(query, con=self.connection)
 96 | 
 97 | 
 98 | class TestLoader:
 99 |     def __init__(self, passengers_filename, real_loader):
100 |         self.passengers_filename = passengers_filename
101 |         self.real_loader = real_loader
102 |         if not os.path.isfile(self.passengers_filename):
103 |             df = self.real_loader.get_passengers()
104 |             df.to_pickle(self.passengers_filename)
105 | 
106 |     def get_passengers(self):
107 |         return pd.read_pickle(self.passengers_filename)
108 | 
109 | 
110 | class PassengerLoader:
111 |     def __init__(self, loader, rare_titles=None):
112 |         self.loader = loader
113 |         self.rare_titles = rare_titles
114 | 
115 |     def get_passengers(self):
116 |         passengers = []
117 |         for data in self.loader.get_passengers().itertuples():
118 |             # parch = Parents/Children, sibsp = Siblings/Spouses
119 |             family_size = int(data.parch + data.sibsp)
120 |             # Allen, Miss. Elisabeth Walton
121 |             title = data.name.split(',')[1].split('.')[0].strip()
122 |             passenger = Passenger(
123 |                 pid=int(data.pid),
124 |                 pclass=int(data.pclass),
125 |                 sex=str(data.sex),
126 |                 age=float(data.age),
127 |                 family_size=family_size,
128 |                 fare=float(data.fare),
129 |                 embarked=str(data.embarked),
130 |                 is_alone=1 if family_size == 1 else 0,
131 |                 title='rare' if title in self.rare_titles else title,
132 |                 is_survived=int(data.is_survived),
133 |             )
134 |             passengers.append(passenger)
135 |         return passengers
136 | 
137 | 
138 | class TitanicModel:
139 |     def __init__(self):
140 |         self.one_hot_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
141 |         self.knn_imputer = KNNImputer(n_neighbors=5)
142 |         self.robust_scaler = RobustScaler()
143 |         self.predictor = LogisticRegression(random_state=0)
144 | 
145 |     def train(self):
146 |         pass
147 | 
148 |     def estimate(self, passengers):
149 |         return 1
150 | 
151 | 
152 | class TitanicModelCreator:
153 |     def __init__(self, loader):
154 |         self.loader = loader
155 |         np.random.seed(42)
156 | 
157 |     def split_passengers(self, passengers):
158 |         passengers_map = {p.pid: p for p in passengers}
159 |         pids = [passenger.pid for passenger in passengers]
160 |         targets = [passenger.is_survived for passenger in passengers]
161 |         train_pids, test_pids = train_test_split(pids, stratify=targets, test_size=0.2)
162 |         train_passengers = [passengers_map[pid] for pid in train_pids]
163 |         test_passengers = [passengers_map[pid] for pid in test_pids]
164 |         return train_passengers, test_passengers
165 | 
166 |     def run(self):
167 |         passengers = self.loader.get_passengers()
168 |         train_passengers, test_passengers = self.split_passengers(passengers)
169 | 
170 |         X_train = pd.DataFrame([v.dict() for v in train_passengers])
171 |         y_train = [v.is_survived for v in train_passengers]
172 |         X_test = pd.DataFrame([v.dict() for v in test_passengers])
173 |         y_test = [v.is_survived for v in test_passengers]
174 | 
175 |         # --- TRAINING ---
176 |         model = TitanicModel()
177 | 
178 |         X_train_categorical = X_train[
179 |             ['embarked', 'sex', 'pclass', 'title', 'is_alone']
180 |         ]
181 | 
182 |         model.one_hot_encoder.fit(X_train_categorical)
183 |         X_train_categorical_one_hot = model.one_hot_encoder.transform(X_train_categorical)
184 | 
185 |         X_train_numerical = X_train[['age', 'fare', 'family_size']]
186 |         model.knn_imputer.fit(X_train_numerical)
187 |         X_train_numerical_imputed = model.knn_imputer.transform(X_train_numerical)
188 | 
189 |         model.robust_scaler.fit(X_train_numerical_imputed)
190 |         X_train_numerical_imputed_scaled = model.robust_scaler.transform(
191 |             X_train_numerical_imputed
192 |         )
193 | 
194 |         X_train_processed = np.hstack(
195 |             (X_train_categorical_one_hot, X_train_numerical_imputed_scaled)
196 |         )
197 | 
198 |         model.predictor.fit(X_train_processed, y_train)
199 |         y_train_estimation = model.predictor.predict(X_train_processed)
200 | 
201 |         cm_train = confusion_matrix(y_train, y_train_estimation)
202 | 
203 |         # --- TESTING ---
204 |         X_test_categorical = X_test[['embarked', 'sex', 'pclass', 'title', 'is_alone']]
205 |         X_test_categorical_one_hot = model.one_hot_encoder.transform(X_test_categorical)
206 | 
207 |         X_test_numerical = X_test[['age', 'fare', 'family_size']]
208 |         X_test_numerical_imputed = model.knn_imputer.transform(X_test_numerical)
209 |         X_test_numerical_imputed_scaled = model.robust_scaler.transform(
210 |             X_test_numerical_imputed
211 |         )
212 | 
213 |         X_test_processed = np.hstack(
214 |             (X_test_categorical_one_hot, X_test_numerical_imputed_scaled)
215 |         )
216 | 
217 |         y_test_estimation = model.predictor.predict(X_test_processed)
218 |         cm_test = confusion_matrix(y_test, y_test_estimation)
219 | 
220 |         print('cm_train', cm_train)
221 |         print('cm_test', cm_test)
222 | 
223 |         do_test('../data/cm_test.pkl', cm_test)
224 |         do_test('../data/cm_train.pkl', cm_train)
225 |         do_test('../data/X_train_processed.pkl', X_train_processed)
226 |         do_test('../data/X_test_processed.pkl', X_test_processed)
227 | 
228 |         do_pandas_test(
229 |             '../data/df_no_tickets.pkl', pd.DataFrame([v.dict() for v in passengers])
230 |         )
231 | 
232 | 
233 | def main(param: str = 'pass'):
234 |     titanic_model_creator = TitanicModelCreator(
235 |         loader=PassengerLoader(
236 |             loader=SqlLoader(connection_string='sqlite:///../data/titanic.db'),
237 |             rare_titles=RARE_TITLES,
238 |         )
239 |     )
240 |     titanic_model_creator.run()
241 | 
242 | 
243 | def test_main(param: str = 'pass'):
244 |     titanic_model_creator = TitanicModelCreator(
245 |         loader=PassengerLoader(
246 |             loader=TestLoader(
247 |                 passengers_filename='../data/passengers_with_is_survived.pkl',
248 |                 real_loader=SqlLoader(connection_string='sqlite:///../data/titanic.db'),
249 |             ),
250 |             rare_titles=RARE_TITLES,
251 |         )
252 |     )
253 |     titanic_model_creator.run()
254 | 
255 | 
256 | if __name__ == "__main__":
257 |     typer.run(test_main)
258 | 


--------------------------------------------------------------------------------
/Step17/titanic_model.sh:
--------------------------------------------------------------------------------
1 | python3.8 ./titanic_model.py $1
2 | 


--------------------------------------------------------------------------------
/Step18/README17.md:
--------------------------------------------------------------------------------
1 | ### Step 17: Create input processing for `TitanicModel`
2 | 
3 | - Move code in `run()` from between instantiating `TitanicModel` and training (`model.predictor.fit`) to the `process_inputs` function of `TitanicModel`.
4 | - Introduce `self.trained` boolean
5 | - Based on `self.trained` either call the `transform` or `fit_transform` of the `sklearn` input processor functions
6 | 
7 | All the input transformation code happen twice. Once for training data once for evaluation data. While transforming the data is a responsibility of the model. This is a codesmell called "feature envy". `TitanicModelCreator` envies the functionality from `TitanicModel`. There will be several steps to resolve this. The resulting code will create a self contained model that can be shipped independetly from its creator.
8 | 
9 | 


--------------------------------------------------------------------------------
/Step18/make_venv.sh:
--------------------------------------------------------------------------------
1 | python3.8 -m venv .venv
2 | source .venv/bin/activate
3 | pip3 install --upgrade pip
4 | pip3 install setuptools==57.1.0
5 | pip3 install wheel
6 | pip3 install -r requirements.txt
7 | 


--------------------------------------------------------------------------------
/Step18/requirements.txt:
--------------------------------------------------------------------------------
1 | typer
2 | scikit-learn
3 | jupyter
4 | pandas
5 | numpy
6 | sqlalchemy
7 | pydantic
8 | 


--------------------------------------------------------------------------------
/Step18/titanic_model.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import pickle
  3 | import typer
  4 | import numpy as np
  5 | import pandas as pd
  6 | from pydantic import BaseModel
  7 | from sqlalchemy import create_engine
  8 | 
  9 | from sklearn.model_selection import train_test_split
 10 | from sklearn.linear_model import LogisticRegression
 11 | from sklearn.preprocessing import RobustScaler
 12 | from sklearn.preprocessing import OneHotEncoder
 13 | from sklearn.impute import KNNImputer
 14 | from sklearn.metrics import confusion_matrix
 15 | 
 16 | 
 17 | RARE_TITLES = {
 18 |     'Capt',
 19 |     'Col',
 20 |     'Don',
 21 |     'Dona',
 22 |     'Dr',
 23 |     'Jonkheer',
 24 |     'Lady',
 25 |     'Major',
 26 |     'Mlle',
 27 |     'Mme',
 28 |     'Ms',
 29 |     'Rev',
 30 |     'Sir',
 31 |     'the Countess',
 32 | }
 33 | 
 34 | 
 35 | class Passenger(BaseModel):
 36 |     pid: int
 37 |     pclass: int
 38 |     sex: str
 39 |     age: float
 40 |     family_size: int
 41 |     fare: float
 42 |     embarked: str
 43 |     is_alone: int
 44 |     title: str
 45 |     is_survived: int
 46 | 
 47 | 
 48 | def do_test(filename, data):
 49 |     if not os.path.isfile(filename):
 50 |         pickle.dump(data, open(filename, 'wb'))
 51 |     truth = pickle.load(open(filename, 'rb'))
 52 |     try:
 53 |         np.testing.assert_almost_equal(data, truth)
 54 |         print(f'{filename} test passed')
 55 |     except AssertionError as ex:
 56 |         print(f'{filename} test failed {ex}')
 57 | 
 58 | 
 59 | def do_pandas_test(filename, data):
 60 |     if not os.path.isfile(filename):
 61 |         data.to_pickle(filename)
 62 |     truth = pd.read_pickle(filename)
 63 |     try:
 64 |         pd.testing.assert_frame_equal(data, truth)
 65 |         print(f'{filename} pandas test passed')
 66 |     except AssertionError as ex:
 67 |         print(f'{filename} pandas test failed {ex}')
 68 | 
 69 | 
 70 | class SqlLoader:
 71 |     def __init__(self, connection_string):
 72 |         engine = create_engine(connection_string)
 73 |         self.connection = engine.connect()
 74 | 
 75 |     def get_passengers(self):
 76 |         query = """
 77 |             SELECT
 78 |                 tbl_passengers.pid,
 79 |                 tbl_passengers.pclass,
 80 |                 tbl_passengers.sex,
 81 |                 tbl_passengers.age,
 82 |                 tbl_passengers.parch,
 83 |                 tbl_passengers.sibsp,
 84 |                 tbl_passengers.fare,
 85 |                 tbl_passengers.embarked,
 86 |                 tbl_passengers.name,
 87 |                 tbl_targets.is_survived
 88 |             FROM
 89 |                 tbl_passengers
 90 |             JOIN
 91 |                 tbl_targets
 92 |             ON
 93 |                 tbl_passengers.pid=tbl_targets.pid
 94 |         """
 95 |         return pd.read_sql(query, con=self.connection)
 96 | 
 97 | 
 98 | class TestLoader:
 99 |     def __init__(self, passengers_filename, real_loader):
100 |         self.passengers_filename = passengers_filename
101 |         self.real_loader = real_loader
102 |         if not os.path.isfile(self.passengers_filename):
103 |             df = self.real_loader.get_passengers()
104 |             df.to_pickle(self.passengers_filename)
105 | 
106 |     def get_passengers(self):
107 |         return pd.read_pickle(self.passengers_filename)
108 | 
109 | 
110 | class PassengerLoader:
111 |     def __init__(self, loader, rare_titles=None):
112 |         self.loader = loader
113 |         self.rare_titles = rare_titles
114 | 
115 |     def get_passengers(self):
116 |         passengers = []
117 |         for data in self.loader.get_passengers().itertuples():
118 |             # parch = Parents/Children, sibsp = Siblings/Spouses
119 |             family_size = int(data.parch + data.sibsp)
120 |             # Allen, Miss. Elisabeth Walton
121 |             title = data.name.split(',')[1].split('.')[0].strip()
122 |             passenger = Passenger(
123 |                 pid=int(data.pid),
124 |                 pclass=int(data.pclass),
125 |                 sex=str(data.sex),
126 |                 age=float(data.age),
127 |                 family_size=family_size,
128 |                 fare=float(data.fare),
129 |                 embarked=str(data.embarked),
130 |                 is_alone=1 if family_size == 1 else 0,
131 |                 title='rare' if title in self.rare_titles else title,
132 |                 is_survived=int(data.is_survived),
133 |             )
134 |             passengers.append(passenger)
135 |         return passengers
136 | 
137 | 
138 | class TitanicModel:
139 |     def __init__(self):
140 |         self.trained = False
141 |         self.one_hot_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
142 |         self.knn_imputer = KNNImputer(n_neighbors=5)
143 |         self.robust_scaler = RobustScaler()
144 |         self.predictor = LogisticRegression(random_state=0)
145 | 
146 |     def process_inputs(self, passengers):
147 |         data = pd.DataFrame([v.dict() for v in passengers])
148 |         categorical_data = data[['embarked', 'sex', 'pclass', 'title', 'is_alone']]
149 |         numerical_data = data[['age', 'fare', 'family_size']]
150 |         if self.trained:
151 |             categorical_data = self.one_hot_encoder.transform(categorical_data)
152 |             numerical_data = self.robust_scaler.transform(
153 |                 self.knn_imputer.transform(numerical_data)
154 |             )
155 |         else:
156 |             categorical_data = self.one_hot_encoder.fit_transform(categorical_data)
157 |             numerical_data = self.robust_scaler.fit_transform(
158 |                 self.knn_imputer.fit_transform(numerical_data)
159 |             )
160 |         return np.hstack((categorical_data, numerical_data))
161 | 
162 |     def train(self):
163 |         pass
164 | 
165 |     def estimate(self, passengers):
166 |         return 1
167 | 
168 | 
169 | class TitanicModelCreator:
170 |     def __init__(self, loader):
171 |         self.loader = loader
172 |         np.random.seed(42)
173 | 
174 |     def split_passengers(self, passengers):
175 |         passengers_map = {p.pid: p for p in passengers}
176 |         pids = [passenger.pid for passenger in passengers]
177 |         targets = [passenger.is_survived for passenger in passengers]
178 |         train_pids, test_pids = train_test_split(pids, stratify=targets, test_size=0.2)
179 |         train_passengers = [passengers_map[pid] for pid in train_pids]
180 |         test_passengers = [passengers_map[pid] for pid in test_pids]
181 |         return train_passengers, test_passengers
182 | 
183 |     def run(self):
184 |         passengers = self.loader.get_passengers()
185 |         train_passengers, test_passengers = self.split_passengers(passengers)
186 | 
187 |         y_train = [v.is_survived for v in train_passengers]
188 |         X_test = pd.DataFrame([v.dict() for v in test_passengers])
189 |         y_test = [v.is_survived for v in test_passengers]
190 | 
191 |         # --- TRAINING ---
192 |         model = TitanicModel()
193 | 
194 |         X_train_processed = model.process_inputs(train_passengers)
195 |         model.predictor.fit(X_train_processed, y_train)
196 |         y_train_estimation = model.predictor.predict(X_train_processed)
197 | 
198 |         cm_train = confusion_matrix(y_train, y_train_estimation)
199 | 
200 |         # --- TESTING ---
201 |         X_test_categorical = X_test[['embarked', 'sex', 'pclass', 'title', 'is_alone']]
202 |         X_test_categorical_one_hot = model.one_hot_encoder.transform(X_test_categorical)
203 | 
204 |         X_test_numerical = X_test[['age', 'fare', 'family_size']]
205 |         X_test_numerical_imputed = model.knn_imputer.transform(X_test_numerical)
206 |         X_test_numerical_imputed_scaled = model.robust_scaler.transform(
207 |             X_test_numerical_imputed
208 |         )
209 | 
210 |         X_test_processed = np.hstack(
211 |             (X_test_categorical_one_hot, X_test_numerical_imputed_scaled)
212 |         )
213 | 
214 |         y_test_estimation = model.predictor.predict(X_test_processed)
215 |         cm_test = confusion_matrix(y_test, y_test_estimation)
216 | 
217 |         print('cm_train', cm_train)
218 |         print('cm_test', cm_test)
219 | 
220 |         do_test('../data/cm_test.pkl', cm_test)
221 |         do_test('../data/cm_train.pkl', cm_train)
222 |         do_test('../data/X_train_processed.pkl', X_train_processed)
223 |         do_test('../data/X_test_processed.pkl', X_test_processed)
224 | 
225 |         do_pandas_test(
226 |             '../data/df_no_tickets.pkl', pd.DataFrame([v.dict() for v in passengers])
227 |         )
228 | 
229 | 
230 | def main(param: str = 'pass'):
231 |     titanic_model_creator = TitanicModelCreator(
232 |         loader=PassengerLoader(
233 |             loader=SqlLoader(connection_string='sqlite:///../data/titanic.db'),
234 |             rare_titles=RARE_TITLES,
235 |         )
236 |     )
237 |     titanic_model_creator.run()
238 | 
239 | 
240 | def test_main(param: str = 'pass'):
241 |     titanic_model_creator = TitanicModelCreator(
242 |         loader=PassengerLoader(
243 |             loader=TestLoader(
244 |                 passengers_filename='../data/passengers_with_is_survived.pkl',
245 |                 real_loader=SqlLoader(connection_string='sqlite:///../data/titanic.db'),
246 |             ),
247 |             rare_titles=RARE_TITLES,
248 |         )
249 |     )
250 |     titanic_model_creator.run()
251 | 
252 | 
253 | if __name__ == "__main__":
254 |     typer.run(test_main)
255 | 


--------------------------------------------------------------------------------
/Step18/titanic_model.sh:
--------------------------------------------------------------------------------
1 | python3.8 ./titanic_model.py $1
2 | 


--------------------------------------------------------------------------------
/Step19/README18.md:
--------------------------------------------------------------------------------
1 | ### Step 18: Move training into `TitanicModel`
2 | 
3 | - Use the same interface as `process_inputs` with `train()`
4 | - Process the data with `process_inputs` (just pass through the arguments)
5 | - Recreate the required targets with the mapping
6 | - Train the model and set the `trained` boolean to `True`
7 | 


--------------------------------------------------------------------------------
/Step19/make_venv.sh:
--------------------------------------------------------------------------------
1 | python3.8 -m venv .venv
2 | source .venv/bin/activate
3 | pip3 install --upgrade pip
4 | pip3 install setuptools==57.1.0
5 | pip3 install wheel
6 | pip3 install -r requirements.txt
7 | 


--------------------------------------------------------------------------------
/Step19/requirements.txt:
--------------------------------------------------------------------------------
1 | typer
2 | scikit-learn
3 | jupyter
4 | pandas
5 | numpy
6 | sqlalchemy
7 | pydantic
8 | 


--------------------------------------------------------------------------------
/Step19/titanic_model.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import pickle
  3 | import typer
  4 | import numpy as np
  5 | import pandas as pd
  6 | from pydantic import BaseModel
  7 | from sqlalchemy import create_engine
  8 | 
  9 | from sklearn.model_selection import train_test_split
 10 | from sklearn.linear_model import LogisticRegression
 11 | from sklearn.preprocessing import RobustScaler
 12 | from sklearn.preprocessing import OneHotEncoder
 13 | from sklearn.impute import KNNImputer
 14 | from sklearn.metrics import confusion_matrix
 15 | 
 16 | 
 17 | RARE_TITLES = {
 18 |     'Capt',
 19 |     'Col',
 20 |     'Don',
 21 |     'Dona',
 22 |     'Dr',
 23 |     'Jonkheer',
 24 |     'Lady',
 25 |     'Major',
 26 |     'Mlle',
 27 |     'Mme',
 28 |     'Ms',
 29 |     'Rev',
 30 |     'Sir',
 31 |     'the Countess',
 32 | }
 33 | 
 34 | 
 35 | class Passenger(BaseModel):
 36 |     pid: int
 37 |     pclass: int
 38 |     sex: str
 39 |     age: float
 40 |     family_size: int
 41 |     fare: float
 42 |     embarked: str
 43 |     is_alone: int
 44 |     title: str
 45 |     is_survived: int
 46 | 
 47 | 
 48 | def do_test(filename, data):
 49 |     if not os.path.isfile(filename):
 50 |         pickle.dump(data, open(filename, 'wb'))
 51 |     truth = pickle.load(open(filename, 'rb'))
 52 |     try:
 53 |         np.testing.assert_almost_equal(data, truth)
 54 |         print(f'{filename} test passed')
 55 |     except AssertionError as ex:
 56 |         print(f'{filename} test failed {ex}')
 57 | 
 58 | 
 59 | def do_pandas_test(filename, data):
 60 |     if not os.path.isfile(filename):
 61 |         data.to_pickle(filename)
 62 |     truth = pd.read_pickle(filename)
 63 |     try:
 64 |         pd.testing.assert_frame_equal(data, truth)
 65 |         print(f'{filename} pandas test passed')
 66 |     except AssertionError as ex:
 67 |         print(f'{filename} pandas test failed {ex}')
 68 | 
 69 | 
 70 | class SqlLoader:
 71 |     def __init__(self, connection_string):
 72 |         engine = create_engine(connection_string)
 73 |         self.connection = engine.connect()
 74 | 
 75 |     def get_passengers(self):
 76 |         query = """
 77 |             SELECT
 78 |                 tbl_passengers.pid,
 79 |                 tbl_passengers.pclass,
 80 |                 tbl_passengers.sex,
 81 |                 tbl_passengers.age,
 82 |                 tbl_passengers.parch,
 83 |                 tbl_passengers.sibsp,
 84 |                 tbl_passengers.fare,
 85 |                 tbl_passengers.embarked,
 86 |                 tbl_passengers.name,
 87 |                 tbl_targets.is_survived
 88 |             FROM
 89 |                 tbl_passengers
 90 |             JOIN
 91 |                 tbl_targets
 92 |             ON
 93 |                 tbl_passengers.pid=tbl_targets.pid
 94 |         """
 95 |         return pd.read_sql(query, con=self.connection)
 96 | 
 97 | 
 98 | class TestLoader:
 99 |     def __init__(self, passengers_filename, real_loader):
100 |         self.passengers_filename = passengers_filename
101 |         self.real_loader = real_loader
102 |         if not os.path.isfile(self.passengers_filename):
103 |             df = self.real_loader.get_passengers()
104 |             df.to_pickle(self.passengers_filename)
105 | 
106 |     def get_passengers(self):
107 |         return pd.read_pickle(self.passengers_filename)
108 | 
109 | 
110 | class PassengerLoader:
111 |     def __init__(self, loader, rare_titles=None):
112 |         self.loader = loader
113 |         self.rare_titles = rare_titles
114 | 
115 |     def get_passengers(self):
116 |         passengers = []
117 |         for data in self.loader.get_passengers().itertuples():
118 |             # parch = Parents/Children, sibsp = Siblings/Spouses
119 |             family_size = int(data.parch + data.sibsp)
120 |             # Allen, Miss. Elisabeth Walton
121 |             title = data.name.split(',')[1].split('.')[0].strip()
122 |             passenger = Passenger(
123 |                 pid=int(data.pid),
124 |                 pclass=int(data.pclass),
125 |                 sex=str(data.sex),
126 |                 age=float(data.age),
127 |                 family_size=family_size,
128 |                 fare=float(data.fare),
129 |                 embarked=str(data.embarked),
130 |                 is_alone=1 if family_size == 1 else 0,
131 |                 title='rare' if title in self.rare_titles else title,
132 |                 is_survived=int(data.is_survived),
133 |             )
134 |             passengers.append(passenger)
135 |         return passengers
136 | 
137 | 
138 | class TitanicModel:
139 |     def __init__(self):
140 |         self.trained = False
141 |         self.one_hot_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
142 |         self.knn_imputer = KNNImputer(n_neighbors=5)
143 |         self.robust_scaler = RobustScaler()
144 |         self.predictor = LogisticRegression(random_state=0)
145 | 
146 |     def process_inputs(self, passengers):
147 |         data = pd.DataFrame([v.dict() for v in passengers])
148 |         categorical_data = data[['embarked', 'sex', 'pclass', 'title', 'is_alone']]
149 |         numerical_data = data[['age', 'fare', 'family_size']]
150 |         if self.trained:
151 |             categorical_data = self.one_hot_encoder.transform(categorical_data)
152 |             numerical_data = self.robust_scaler.transform(
153 |                 self.knn_imputer.transform(numerical_data)
154 |             )
155 |         else:
156 |             categorical_data = self.one_hot_encoder.fit_transform(categorical_data)
157 |             numerical_data = self.robust_scaler.fit_transform(
158 |                 self.knn_imputer.fit_transform(numerical_data)
159 |             )
160 |         return np.hstack((categorical_data, numerical_data))
161 | 
162 |     def train(self, passengers):
163 |         targets = [v.is_survived for v in passengers]
164 |         inputs = self.process_inputs(passengers)
165 |         self.predictor.fit(inputs, targets)
166 |         self.trained = True
167 | 
168 |     def estimate(self, passengers):
169 |         return 1
170 | 
171 | 
172 | class TitanicModelCreator:
173 |     def __init__(self, loader):
174 |         self.loader = loader
175 |         np.random.seed(42)
176 | 
177 |     def split_passengers(self, passengers):
178 |         passengers_map = {p.pid: p for p in passengers}
179 |         pids = [passenger.pid for passenger in passengers]
180 |         targets = [passenger.is_survived for passenger in passengers]
181 |         train_pids, test_pids = train_test_split(pids, stratify=targets, test_size=0.2)
182 |         train_passengers = [passengers_map[pid] for pid in train_pids]
183 |         test_passengers = [passengers_map[pid] for pid in test_pids]
184 |         return train_passengers, test_passengers
185 | 
186 |     def run(self):
187 |         passengers = self.loader.get_passengers()
188 |         train_passengers, test_passengers = self.split_passengers(passengers)
189 | 
190 |         y_train = [v.is_survived for v in train_passengers]
191 |         X_test = pd.DataFrame([v.dict() for v in test_passengers])
192 |         y_test = [v.is_survived for v in test_passengers]
193 | 
194 |         # --- TRAINING ---
195 |         model = TitanicModel()
196 |         model.train(train_passengers)
197 | 
198 |         X_train_processed = model.process_inputs(train_passengers)
199 |         y_train_estimation = model.predictor.predict(X_train_processed)
200 |         cm_train = confusion_matrix(y_train, y_train_estimation)
201 | 
202 |         # --- TESTING ---
203 |         X_test_categorical = X_test[['embarked', 'sex', 'pclass', 'title', 'is_alone']]
204 |         X_test_categorical_one_hot = model.one_hot_encoder.transform(X_test_categorical)
205 | 
206 |         X_test_numerical = X_test[['age', 'fare', 'family_size']]
207 |         X_test_numerical_imputed = model.knn_imputer.transform(X_test_numerical)
208 |         X_test_numerical_imputed_scaled = model.robust_scaler.transform(
209 |             X_test_numerical_imputed
210 |         )
211 | 
212 |         X_test_processed = np.hstack(
213 |             (X_test_categorical_one_hot, X_test_numerical_imputed_scaled)
214 |         )
215 | 
216 |         y_test_estimation = model.predictor.predict(X_test_processed)
217 |         cm_test = confusion_matrix(y_test, y_test_estimation)
218 | 
219 |         print('cm_train', cm_train)
220 |         print('cm_test', cm_test)
221 | 
222 |         do_test('../data/cm_test.pkl', cm_test)
223 |         do_test('../data/cm_train.pkl', cm_train)
224 |         do_test('../data/X_train_processed.pkl', X_train_processed)
225 |         do_test('../data/X_test_processed.pkl', X_test_processed)
226 | 
227 |         do_pandas_test(
228 |             '../data/df_no_tickets.pkl', pd.DataFrame([v.dict() for v in passengers])
229 |         )
230 | 
231 | 
232 | def main(param: str = 'pass'):
233 |     titanic_model_creator = TitanicModelCreator(
234 |         loader=PassengerLoader(
235 |             loader=SqlLoader(connection_string='sqlite:///../data/titanic.db'),
236 |             rare_titles=RARE_TITLES,
237 |         )
238 |     )
239 |     titanic_model_creator.run()
240 | 
241 | 
242 | def test_main(param: str = 'pass'):
243 |     titanic_model_creator = TitanicModelCreator(
244 |         loader=PassengerLoader(
245 |             loader=TestLoader(
246 |                 passengers_filename='../data/passengers_with_is_survived.pkl',
247 |                 real_loader=SqlLoader(connection_string='sqlite:///../data/titanic.db'),
248 |             ),
249 |             rare_titles=RARE_TITLES,
250 |         )
251 |     )
252 |     titanic_model_creator.run()
253 | 
254 | 
255 | if __name__ == "__main__":
256 |     typer.run(test_main)
257 | 


--------------------------------------------------------------------------------
/Step19/titanic_model.sh:
--------------------------------------------------------------------------------
1 | python3.8 ./titanic_model.py $1
2 | 


--------------------------------------------------------------------------------
/Step20/README19.md:
--------------------------------------------------------------------------------
 1 | ### Step 19: Move prediction to `TitanicModel`
 2 | 
 3 | - Create the `estimate` function
 4 | - Call `proccess_inputs` and `predictor.predict` in it
 5 | - Remove all evaluation input processing code
 6 | - Call `estimate` from `run`
 7 | 
 8 | Because there was no separation of concerns the input processing code was duplicated and now that we moved it to its own location it can be removed.
 9 | 
10 | `X_train_processed` and `X_test_processed` do not exist any more so to pass the tests they need to be recreated. This is a good point to think about why this is necessary and find a different way to test behaviour. To keep the project short we set aside this but this would be a good place to introduce more tests.
11 | 


--------------------------------------------------------------------------------
/Step20/make_venv.sh:
--------------------------------------------------------------------------------
1 | python3.8 -m venv .venv
2 | source .venv/bin/activate
3 | pip3 install --upgrade pip
4 | pip3 install setuptools==57.1.0
5 | pip3 install wheel
6 | pip3 install -r requirements.txt
7 | 


--------------------------------------------------------------------------------
/Step20/requirements.txt:
--------------------------------------------------------------------------------
1 | typer
2 | scikit-learn
3 | jupyter
4 | pandas
5 | numpy
6 | sqlalchemy
7 | pydantic
8 | 


--------------------------------------------------------------------------------
/Step20/titanic_model.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import pickle
  3 | import typer
  4 | import numpy as np
  5 | import pandas as pd
  6 | from pydantic import BaseModel
  7 | from sqlalchemy import create_engine
  8 | 
  9 | from sklearn.model_selection import train_test_split
 10 | from sklearn.linear_model import LogisticRegression
 11 | from sklearn.preprocessing import RobustScaler
 12 | from sklearn.preprocessing import OneHotEncoder
 13 | from sklearn.impute import KNNImputer
 14 | from sklearn.metrics import confusion_matrix
 15 | 
 16 | 
 17 | RARE_TITLES = {
 18 |     'Capt',
 19 |     'Col',
 20 |     'Don',
 21 |     'Dona',
 22 |     'Dr',
 23 |     'Jonkheer',
 24 |     'Lady',
 25 |     'Major',
 26 |     'Mlle',
 27 |     'Mme',
 28 |     'Ms',
 29 |     'Rev',
 30 |     'Sir',
 31 |     'the Countess',
 32 | }
 33 | 
 34 | 
 35 | class Passenger(BaseModel):
 36 |     pid: int
 37 |     pclass: int
 38 |     sex: str
 39 |     age: float
 40 |     family_size: int
 41 |     fare: float
 42 |     embarked: str
 43 |     is_alone: int
 44 |     title: str
 45 |     is_survived: int
 46 | 
 47 | 
 48 | def do_test(filename, data):
 49 |     if not os.path.isfile(filename):
 50 |         pickle.dump(data, open(filename, 'wb'))
 51 |     truth = pickle.load(open(filename, 'rb'))
 52 |     try:
 53 |         np.testing.assert_almost_equal(data, truth)
 54 |         print(f'{filename} test passed')
 55 |     except AssertionError as ex:
 56 |         print(f'{filename} test failed {ex}')
 57 | 
 58 | 
 59 | def do_pandas_test(filename, data):
 60 |     if not os.path.isfile(filename):
 61 |         data.to_pickle(filename)
 62 |     truth = pd.read_pickle(filename)
 63 |     try:
 64 |         pd.testing.assert_frame_equal(data, truth)
 65 |         print(f'{filename} pandas test passed')
 66 |     except AssertionError as ex:
 67 |         print(f'{filename} pandas test failed {ex}')
 68 | 
 69 | 
 70 | class SqlLoader:
 71 |     def __init__(self, connection_string):
 72 |         engine = create_engine(connection_string)
 73 |         self.connection = engine.connect()
 74 | 
 75 |     def get_passengers(self):
 76 |         query = """
 77 |             SELECT
 78 |                 tbl_passengers.pid,
 79 |                 tbl_passengers.pclass,
 80 |                 tbl_passengers.sex,
 81 |                 tbl_passengers.age,
 82 |                 tbl_passengers.parch,
 83 |                 tbl_passengers.sibsp,
 84 |                 tbl_passengers.fare,
 85 |                 tbl_passengers.embarked,
 86 |                 tbl_passengers.name,
 87 |                 tbl_targets.is_survived
 88 |             FROM
 89 |                 tbl_passengers
 90 |             JOIN
 91 |                 tbl_targets
 92 |             ON
 93 |                 tbl_passengers.pid=tbl_targets.pid
 94 |         """
 95 |         return pd.read_sql(query, con=self.connection)
 96 | 
 97 | 
 98 | class TestLoader:
 99 |     def __init__(self, passengers_filename, real_loader):
100 |         self.passengers_filename = passengers_filename
101 |         self.real_loader = real_loader
102 |         if not os.path.isfile(self.passengers_filename):
103 |             df = self.real_loader.get_passengers()
104 |             df.to_pickle(self.passengers_filename)
105 | 
106 |     def get_passengers(self):
107 |         return pd.read_pickle(self.passengers_filename)
108 | 
109 | 
110 | class PassengerLoader:
111 |     def __init__(self, loader, rare_titles=None):
112 |         self.loader = loader
113 |         self.rare_titles = rare_titles
114 | 
115 |     def get_passengers(self):
116 |         passengers = []
117 |         for data in self.loader.get_passengers().itertuples():
118 |             # parch = Parents/Children, sibsp = Siblings/Spouses
119 |             family_size = int(data.parch + data.sibsp)
120 |             # Allen, Miss. Elisabeth Walton
121 |             title = data.name.split(',')[1].split('.')[0].strip()
122 |             passenger = Passenger(
123 |                 pid=int(data.pid),
124 |                 pclass=int(data.pclass),
125 |                 sex=str(data.sex),
126 |                 age=float(data.age),
127 |                 family_size=family_size,
128 |                 fare=float(data.fare),
129 |                 embarked=str(data.embarked),
130 |                 is_alone=1 if family_size == 1 else 0,
131 |                 title='rare' if title in self.rare_titles else title,
132 |                 is_survived=int(data.is_survived),
133 |             )
134 |             passengers.append(passenger)
135 |         return passengers
136 | 
137 | 
138 | class TitanicModel:
139 |     def __init__(self):
140 |         self.trained = False
141 |         self.one_hot_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
142 |         self.knn_imputer = KNNImputer(n_neighbors=5)
143 |         self.robust_scaler = RobustScaler()
144 |         self.predictor = LogisticRegression(random_state=0)
145 | 
146 |     def process_inputs(self, passengers):
147 |         data = pd.DataFrame([v.dict() for v in passengers])
148 |         categorical_data = data[['embarked', 'sex', 'pclass', 'title', 'is_alone']]
149 |         numerical_data = data[['age', 'fare', 'family_size']]
150 |         if self.trained:
151 |             categorical_data = self.one_hot_encoder.transform(categorical_data)
152 |             numerical_data = self.robust_scaler.transform(
153 |                 self.knn_imputer.transform(numerical_data)
154 |             )
155 |         else:
156 |             categorical_data = self.one_hot_encoder.fit_transform(categorical_data)
157 |             numerical_data = self.robust_scaler.fit_transform(
158 |                 self.knn_imputer.fit_transform(numerical_data)
159 |             )
160 |         return np.hstack((categorical_data, numerical_data))
161 | 
162 |     def train(self, passengers):
163 |         targets = [v.is_survived for v in passengers]
164 |         inputs = self.process_inputs(passengers)
165 |         self.predictor.fit(inputs, targets)
166 |         self.trained = True
167 | 
168 |     def estimate(self, passengers):
169 |         inputs = self.process_inputs(passengers)
170 |         return self.predictor.predict(inputs)
171 | 
172 | 
173 | class TitanicModelCreator:
174 |     def __init__(self, loader):
175 |         self.loader = loader
176 |         np.random.seed(42)
177 | 
178 |     def split_passengers(self, passengers):
179 |         passengers_map = {p.pid: p for p in passengers}
180 |         pids = [passenger.pid for passenger in passengers]
181 |         targets = [passenger.is_survived for passenger in passengers]
182 |         train_pids, test_pids = train_test_split(pids, stratify=targets, test_size=0.2)
183 |         train_passengers = [passengers_map[pid] for pid in train_pids]
184 |         test_passengers = [passengers_map[pid] for pid in test_pids]
185 |         return train_passengers, test_passengers
186 | 
187 |     def run(self):
188 |         passengers = self.loader.get_passengers()
189 |         train_passengers, test_passengers = self.split_passengers(passengers)
190 | 
191 |         # --- TRAINING ---
192 |         model = TitanicModel()
193 |         model.train(train_passengers)
194 |         y_train_estimation = model.estimate(train_passengers)
195 |         cm_train = confusion_matrix(
196 |             [v.is_survived for v in train_passengers], y_train_estimation
197 |         )
198 | 
199 |         # --- TESTING ---
200 |         y_test_estimation = model.estimate(test_passengers)
201 |         cm_test = confusion_matrix(
202 |             [v.is_survived for v in test_passengers], y_test_estimation
203 |         )
204 | 
205 |         print('cm_train', cm_train)
206 |         print('cm_test', cm_test)
207 | 
208 |         do_test('../data/cm_test.pkl', cm_test)
209 |         do_test('../data/cm_train.pkl', cm_train)
210 |         X_train_processed = model.process_inputs(train_passengers)
211 |         do_test('../data/X_train_processed.pkl', X_train_processed)
212 |         X_test_processed = model.process_inputs(test_passengers)
213 |         do_test('../data/X_test_processed.pkl', X_test_processed)
214 | 
215 |         do_pandas_test(
216 |             '../data/df_no_tickets.pkl', pd.DataFrame([v.dict() for v in passengers])
217 |         )
218 | 
219 | 
220 | def main(param: str = 'pass'):
221 |     titanic_model_creator = TitanicModelCreator(
222 |         loader=PassengerLoader(
223 |             loader=SqlLoader(connection_string='sqlite:///../data/titanic.db'),
224 |             rare_titles=RARE_TITLES,
225 |         )
226 |     )
227 |     titanic_model_creator.run()
228 | 
229 | 
230 | def test_main(param: str = 'pass'):
231 |     titanic_model_creator = TitanicModelCreator(
232 |         loader=PassengerLoader(
233 |             loader=TestLoader(
234 |                 passengers_filename='../data/passengers_with_is_survived.pkl',
235 |                 real_loader=SqlLoader(connection_string='sqlite:///../data/titanic.db'),
236 |             ),
237 |             rare_titles=RARE_TITLES,
238 |         )
239 |     )
240 |     titanic_model_creator.run()
241 | 
242 | 
243 | if __name__ == "__main__":
244 |     typer.run(test_main)
245 | 


--------------------------------------------------------------------------------
/Step20/titanic_model.sh:
--------------------------------------------------------------------------------
1 | python3.8 ./titanic_model.py $1
2 | 


--------------------------------------------------------------------------------
/Step21/README20.md:
--------------------------------------------------------------------------------
 1 | ### Step 20: Save model and move tests to custom model savers
 2 | 
 3 | - Create `ModelSaver` that has a `save_model` interface that accepts a model and a result object
 4 | - Pickle the model and the result to a file
 5 | - Create `TestModelSaver` that has the same interface
 6 | - Move the testing code to the `save_model` function
 7 | - Add `model_saver` property to `TitanicModelCreator` and call it after the evaluation code
 8 | - Add an instance of `ModelSaver` and `TestModelSaver` respectively in `main` and `test_main` to the construction of `TitanicModelCreator`
 9 | 
10 | Currently `TitanicModelCreator` contains its own testing, while this is intended to run in production. It also have no way to save the model. We will introduce the concept of `ModelSaver` here, anything that need to be preserved after the model training need to be passed to this class.
11 | 
12 | We will also move testing into a specific `TestModelSaver` that will instead of saving the model, will run the tests that were otherwise be in `run()`. This way the same code can run in production and in testing without change.
13 | 


--------------------------------------------------------------------------------
/Step21/make_venv.sh:
--------------------------------------------------------------------------------
1 | python3.8 -m venv .venv
2 | source .venv/bin/activate
3 | pip3 install --upgrade pip
4 | pip3 install setuptools==57.1.0
5 | pip3 install wheel
6 | pip3 install -r requirements.txt
7 | 


--------------------------------------------------------------------------------
/Step21/requirements.txt:
--------------------------------------------------------------------------------
1 | typer
2 | scikit-learn
3 | jupyter
4 | pandas
5 | numpy
6 | sqlalchemy
7 | pydantic
8 | 


--------------------------------------------------------------------------------
/Step21/titanic_model.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import pickle
  3 | import typer
  4 | import numpy as np
  5 | import pandas as pd
  6 | from pydantic import BaseModel
  7 | from sqlalchemy import create_engine
  8 | 
  9 | from sklearn.model_selection import train_test_split
 10 | from sklearn.linear_model import LogisticRegression
 11 | from sklearn.preprocessing import RobustScaler
 12 | from sklearn.preprocessing import OneHotEncoder
 13 | from sklearn.impute import KNNImputer
 14 | from sklearn.metrics import confusion_matrix
 15 | 
 16 | 
 17 | RARE_TITLES = {
 18 |     'Capt',
 19 |     'Col',
 20 |     'Don',
 21 |     'Dona',
 22 |     'Dr',
 23 |     'Jonkheer',
 24 |     'Lady',
 25 |     'Major',
 26 |     'Mlle',
 27 |     'Mme',
 28 |     'Ms',
 29 |     'Rev',
 30 |     'Sir',
 31 |     'the Countess',
 32 | }
 33 | 
 34 | 
 35 | class Passenger(BaseModel):
 36 |     pid: int
 37 |     pclass: int
 38 |     sex: str
 39 |     age: float
 40 |     family_size: int
 41 |     fare: float
 42 |     embarked: str
 43 |     is_alone: int
 44 |     title: str
 45 |     is_survived: int
 46 | 
 47 | 
 48 | def do_test(filename, data):
 49 |     if not os.path.isfile(filename):
 50 |         pickle.dump(data, open(filename, 'wb'))
 51 |     truth = pickle.load(open(filename, 'rb'))
 52 |     try:
 53 |         np.testing.assert_almost_equal(data, truth)
 54 |         print(f'{filename} test passed')
 55 |     except AssertionError as ex:
 56 |         print(f'{filename} test failed {ex}')
 57 | 
 58 | 
 59 | def do_pandas_test(filename, data):
 60 |     if not os.path.isfile(filename):
 61 |         data.to_pickle(filename)
 62 |     truth = pd.read_pickle(filename)
 63 |     try:
 64 |         pd.testing.assert_frame_equal(data, truth)
 65 |         print(f'{filename} pandas test passed')
 66 |     except AssertionError as ex:
 67 |         print(f'{filename} pandas test failed {ex}')
 68 | 
 69 | 
 70 | class SqlLoader:
 71 |     def __init__(self, connection_string):
 72 |         engine = create_engine(connection_string)
 73 |         self.connection = engine.connect()
 74 | 
 75 |     def get_passengers(self):
 76 |         query = """
 77 |             SELECT
 78 |                 tbl_passengers.pid,
 79 |                 tbl_passengers.pclass,
 80 |                 tbl_passengers.sex,
 81 |                 tbl_passengers.age,
 82 |                 tbl_passengers.parch,
 83 |                 tbl_passengers.sibsp,
 84 |                 tbl_passengers.fare,
 85 |                 tbl_passengers.embarked,
 86 |                 tbl_passengers.name,
 87 |                 tbl_targets.is_survived
 88 |             FROM
 89 |                 tbl_passengers
 90 |             JOIN
 91 |                 tbl_targets
 92 |             ON
 93 |                 tbl_passengers.pid=tbl_targets.pid
 94 |         """
 95 |         return pd.read_sql(query, con=self.connection)
 96 | 
 97 | 
 98 | class TestLoader:
 99 |     def __init__(self, passengers_filename, real_loader):
100 |         self.passengers_filename = passengers_filename
101 |         self.real_loader = real_loader
102 |         if not os.path.isfile(self.passengers_filename):
103 |             df = self.real_loader.get_passengers()
104 |             df.to_pickle(self.passengers_filename)
105 | 
106 |     def get_passengers(self):
107 |         return pd.read_pickle(self.passengers_filename)
108 | 
109 | 
110 | class ModelSaver:
111 |     def __init__(self, model_filename, result_filename):
112 |         self.model_filename = model_filename
113 |         self.result_filename = result_filename
114 | 
115 |     def save_model(self, model, result):
116 |         pickle.dump(model, open(self.filename, 'wb'))
117 |         pickle.dump(result, open(self.result_filename, 'wb'))
118 | 
119 | 
120 | class TestModelSaver:
121 |     def __init__(self):
122 |         pass
123 | 
124 |     def save_model(self, model, result):
125 |         do_test('../data/cm_test.pkl', result['cm_test'])
126 |         do_test('../data/cm_train.pkl', result['cm_train'])
127 |         X_train_processed = model.process_inputs(result['train_passengers'])
128 |         do_test('../data/X_train_processed.pkl', X_train_processed)
129 |         X_test_processed = model.process_inputs(result['test_passengers'])
130 |         do_test('../data/X_test_processed.pkl', X_test_processed)
131 | 
132 | 
133 | class PassengerLoader:
134 |     def __init__(self, loader, rare_titles=None):
135 |         self.loader = loader
136 |         self.rare_titles = rare_titles
137 | 
138 |     def get_passengers(self):
139 |         passengers = []
140 |         for data in self.loader.get_passengers().itertuples():
141 |             # parch = Parents/Children, sibsp = Siblings/Spouses
142 |             family_size = int(data.parch + data.sibsp)
143 |             # Allen, Miss. Elisabeth Walton
144 |             title = data.name.split(',')[1].split('.')[0].strip()
145 |             passenger = Passenger(
146 |                 pid=int(data.pid),
147 |                 pclass=int(data.pclass),
148 |                 sex=str(data.sex),
149 |                 age=float(data.age),
150 |                 family_size=family_size,
151 |                 fare=float(data.fare),
152 |                 embarked=str(data.embarked),
153 |                 is_alone=1 if family_size == 1 else 0,
154 |                 title='rare' if title in self.rare_titles else title,
155 |                 is_survived=int(data.is_survived),
156 |             )
157 |             passengers.append(passenger)
158 |         return passengers
159 | 
160 | 
161 | class TitanicModel:
162 |     def __init__(self):
163 |         self.trained = False
164 |         self.one_hot_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
165 |         self.knn_imputer = KNNImputer(n_neighbors=5)
166 |         self.robust_scaler = RobustScaler()
167 |         self.predictor = LogisticRegression(random_state=0)
168 | 
169 |     def process_inputs(self, passengers):
170 |         data = pd.DataFrame([v.dict() for v in passengers])
171 |         categorical_data = data[['embarked', 'sex', 'pclass', 'title', 'is_alone']]
172 |         numerical_data = data[['age', 'fare', 'family_size']]
173 |         if self.trained:
174 |             categorical_data = self.one_hot_encoder.transform(categorical_data)
175 |             numerical_data = self.robust_scaler.transform(
176 |                 self.knn_imputer.transform(numerical_data)
177 |             )
178 |         else:
179 |             categorical_data = self.one_hot_encoder.fit_transform(categorical_data)
180 |             numerical_data = self.robust_scaler.fit_transform(
181 |                 self.knn_imputer.fit_transform(numerical_data)
182 |             )
183 |         return np.hstack((categorical_data, numerical_data))
184 | 
185 |     def train(self, passengers):
186 |         targets = [v.is_survived for v in passengers]
187 |         inputs = self.process_inputs(passengers)
188 |         self.predictor.fit(inputs, targets)
189 |         self.trained = True
190 | 
191 |     def estimate(self, passengers):
192 |         inputs = self.process_inputs(passengers)
193 |         return self.predictor.predict(inputs)
194 | 
195 | 
196 | class TitanicModelCreator:
197 |     def __init__(self, loader, model_saver):
198 |         self.loader = loader
199 |         self.model_saver = model_saver
200 |         np.random.seed(42)
201 | 
202 |     def split_passengers(self, passengers):
203 |         passengers_map = {p.pid: p for p in passengers}
204 |         pids = [passenger.pid for passenger in passengers]
205 |         targets = [passenger.is_survived for passenger in passengers]
206 |         train_pids, test_pids = train_test_split(pids, stratify=targets, test_size=0.2)
207 |         train_passengers = [passengers_map[pid] for pid in train_pids]
208 |         test_passengers = [passengers_map[pid] for pid in test_pids]
209 |         return train_passengers, test_passengers
210 | 
211 |     def run(self):
212 |         passengers = self.loader.get_passengers()
213 |         train_passengers, test_passengers = self.split_passengers(passengers)
214 | 
215 |         # --- TRAINING ---
216 |         model = TitanicModel()
217 |         model.train(train_passengers)
218 |         y_train_estimation = model.estimate(train_passengers)
219 |         cm_train = confusion_matrix(
220 |             [v.is_survived for v in train_passengers], y_train_estimation
221 |         )
222 | 
223 |         # --- TESTING ---
224 |         y_test_estimation = model.estimate(test_passengers)
225 |         cm_test = confusion_matrix(
226 |             [v.is_survived for v in test_passengers], y_test_estimation
227 |         )
228 | 
229 |         self.model_saver.save_model(
230 |             model=model,
231 |             result={
232 |                 'cm_train': cm_train,
233 |                 'cm_test': cm_test,
234 |                 'train_passengers': train_passengers,
235 |                 'test_passengers': test_passengers,
236 |             },
237 |         )
238 | 
239 | 
240 | def main(param: str = 'pass'):
241 |     titanic_model_creator = TitanicModelCreator(
242 |         loader=PassengerLoader(
243 |             loader=SqlLoader(connection_string='sqlite:///../data/titanic.db'),
244 |             rare_titles=RARE_TITLES,
245 |         ),
246 |         model_saver=ModelSaver(
247 |             model_filename='../data/real_model.pkl',
248 |             result_filename='../data/real_result.pkl',
249 |         ),
250 |     )
251 |     titanic_model_creator.run()
252 | 
253 | 
254 | def test_main(param: str = 'pass'):
255 |     titanic_model_creator = TitanicModelCreator(
256 |         loader=PassengerLoader(
257 |             loader=TestLoader(
258 |                 passengers_filename='../data/passengers_with_is_survived.pkl',
259 |                 real_loader=SqlLoader(connection_string='sqlite:///../data/titanic.db'),
260 |             ),
261 |             rare_titles=RARE_TITLES,
262 |         ),
263 |         model_saver=TestModelSaver(),
264 |     )
265 |     titanic_model_creator.run()
266 | 
267 | 
268 | if __name__ == "__main__":
269 |     typer.run(test_main)
270 | 


--------------------------------------------------------------------------------
/Step21/titanic_model.sh:
--------------------------------------------------------------------------------
1 | python3.8 ./titanic_model.py $1
2 | 


--------------------------------------------------------------------------------
/Step22/README21.md:
--------------------------------------------------------------------------------
 1 | ### Step 21: Enable training of different models
 2 | 
 3 | - Add `model` property to `TitanicModelCreator` and use in `run()` that instead of the local `TitanicModel` instance.
 4 | - Add `TitanicModel` instantiation to the creation of `TitanicModelCreator` in both `main` and `test_main`
 5 | - Expose parts of `TitanicModel` (predictor, processing parameter)
 6 | 
 7 | At this point the refactoring is pretty much finished. This last step enables the creation of different models. Use existing implementations as templates to create new shell scripts, main functions (contexts) for each experiment that uses new Loaders to create new datasets. Write different test context to make sure the changes you do are as intended.As more experiments emerge, you will see patterns and opportunities to extract common behaviour from similar implementation while still maintaining validity through thes tests. This allows restructuring your code on the fly and find out what is the most convenient architecture for your system. Most problems in these systems are unforeseeable, there is no possibility to figure out the best structure before you start implementation. This require a workflow that enables radical changes even at later stages of the project. Clean Architecture, end-to-end testing and maintaining code quality provides exactly this feature at very low effort.
 8 | 
 9 | Next steps:
10 | 
11 | - Use different data:
12 |   - Update `SqlLoader` to retrieve different data
13 |   - Update `Passenger` class to contain this new data
14 |   - Update `PassengerLoader` class to process this new data into the classes
15 |   - Update `process_inputs` to create features out of this new data
16 | - Use different features
17 |   - Update `process_inputs` in `TitanicModel`, expose parameters as needed
18 | - Use different model:
19 |   - Use different `predictor` in `TitanicModel`
20 | 


--------------------------------------------------------------------------------
/Step22/make_venv.sh:
--------------------------------------------------------------------------------
1 | python3.8 -m venv .venv
2 | source .venv/bin/activate
3 | pip3 install --upgrade pip
4 | pip3 install setuptools==57.1.0
5 | pip3 install wheel
6 | pip3 install -r requirements.txt
7 | 


--------------------------------------------------------------------------------
/Step22/requirements.txt:
--------------------------------------------------------------------------------
1 | typer
2 | scikit-learn
3 | jupyter
4 | pandas
5 | numpy
6 | sqlalchemy
7 | pydantic
8 | 


--------------------------------------------------------------------------------
/Step22/titanic_model.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import pickle
  3 | import typer
  4 | import numpy as np
  5 | import pandas as pd
  6 | from pydantic import BaseModel
  7 | from sqlalchemy import create_engine
  8 | 
  9 | from sklearn.model_selection import train_test_split
 10 | from sklearn.linear_model import LogisticRegression
 11 | from sklearn.preprocessing import RobustScaler
 12 | from sklearn.preprocessing import OneHotEncoder
 13 | from sklearn.impute import KNNImputer
 14 | from sklearn.metrics import confusion_matrix
 15 | 
 16 | 
 17 | RARE_TITLES = {
 18 |     'Capt',
 19 |     'Col',
 20 |     'Don',
 21 |     'Dona',
 22 |     'Dr',
 23 |     'Jonkheer',
 24 |     'Lady',
 25 |     'Major',
 26 |     'Mlle',
 27 |     'Mme',
 28 |     'Ms',
 29 |     'Rev',
 30 |     'Sir',
 31 |     'the Countess',
 32 | }
 33 | 
 34 | 
 35 | class Passenger(BaseModel):
 36 |     pid: int
 37 |     pclass: int
 38 |     sex: str
 39 |     age: float
 40 |     family_size: int
 41 |     fare: float
 42 |     embarked: str
 43 |     is_alone: int
 44 |     title: str
 45 |     is_survived: int
 46 | 
 47 | 
 48 | def do_test(filename, data):
 49 |     if not os.path.isfile(filename):
 50 |         pickle.dump(data, open(filename, 'wb'))
 51 |     truth = pickle.load(open(filename, 'rb'))
 52 |     try:
 53 |         np.testing.assert_almost_equal(data, truth)
 54 |         print(f'{filename} test passed')
 55 |     except AssertionError as ex:
 56 |         print(f'{filename} test failed {ex}')
 57 | 
 58 | 
 59 | def do_pandas_test(filename, data):
 60 |     if not os.path.isfile(filename):
 61 |         data.to_pickle(filename)
 62 |     truth = pd.read_pickle(filename)
 63 |     try:
 64 |         pd.testing.assert_frame_equal(data, truth)
 65 |         print(f'{filename} pandas test passed')
 66 |     except AssertionError as ex:
 67 |         print(f'{filename} pandas test failed {ex}')
 68 | 
 69 | 
 70 | class SqlLoader:
 71 |     def __init__(self, connection_string):
 72 |         engine = create_engine(connection_string)
 73 |         self.connection = engine.connect()
 74 | 
 75 |     def get_passengers(self):
 76 |         query = """
 77 |             SELECT
 78 |                 tbl_passengers.pid,
 79 |                 tbl_passengers.pclass,
 80 |                 tbl_passengers.sex,
 81 |                 tbl_passengers.age,
 82 |                 tbl_passengers.parch,
 83 |                 tbl_passengers.sibsp,
 84 |                 tbl_passengers.fare,
 85 |                 tbl_passengers.embarked,
 86 |                 tbl_passengers.name,
 87 |                 tbl_targets.is_survived
 88 |             FROM
 89 |                 tbl_passengers
 90 |             JOIN
 91 |                 tbl_targets
 92 |             ON
 93 |                 tbl_passengers.pid=tbl_targets.pid
 94 |         """
 95 |         return pd.read_sql(query, con=self.connection)
 96 | 
 97 | 
 98 | class TestLoader:
 99 |     def __init__(self, passengers_filename, real_loader):
100 |         self.passengers_filename = passengers_filename
101 |         self.real_loader = real_loader
102 |         if not os.path.isfile(self.passengers_filename):
103 |             df = self.real_loader.get_passengers()
104 |             df.to_pickle(self.passengers_filename)
105 | 
106 |     def get_passengers(self):
107 |         return pd.read_pickle(self.passengers_filename)
108 | 
109 | 
110 | class ModelSaver:
111 |     def __init__(self, model_filename, result_filename):
112 |         self.model_filename = model_filename
113 |         self.result_filename = result_filename
114 | 
115 |     def save_model(self, model, result):
116 |         pickle.dump(model, open(self.filename, 'wb'))
117 |         pickle.dump(result, open(self.result_filename, 'wb'))
118 | 
119 | 
120 | class TestModelSaver:
121 |     def __init__(self):
122 |         pass
123 | 
124 |     def save_model(self, model, result):
125 |         do_test('../data/cm_test.pkl', result['cm_test'])
126 |         do_test('../data/cm_train.pkl', result['cm_train'])
127 |         X_train_processed = model.process_inputs(result['train_passengers'])
128 |         do_test('../data/X_train_processed.pkl', X_train_processed)
129 |         X_test_processed = model.process_inputs(result['test_passengers'])
130 |         do_test('../data/X_test_processed.pkl', X_test_processed)
131 | 
132 | 
133 | class PassengerLoader:
134 |     def __init__(self, loader, rare_titles=None):
135 |         self.loader = loader
136 |         self.rare_titles = rare_titles
137 | 
138 |     def get_passengers(self):
139 |         passengers = []
140 |         for data in self.loader.get_passengers().itertuples():
141 |             # parch = Parents/Children, sibsp = Siblings/Spouses
142 |             family_size = int(data.parch + data.sibsp)
143 |             # Allen, Miss. Elisabeth Walton
144 |             title = data.name.split(',')[1].split('.')[0].strip()
145 |             passenger = Passenger(
146 |                 pid=int(data.pid),
147 |                 pclass=int(data.pclass),
148 |                 sex=str(data.sex),
149 |                 age=float(data.age),
150 |                 family_size=family_size,
151 |                 fare=float(data.fare),
152 |                 embarked=str(data.embarked),
153 |                 is_alone=1 if family_size == 1 else 0,
154 |                 title='rare' if title in self.rare_titles else title,
155 |                 is_survived=int(data.is_survived),
156 |             )
157 |             passengers.append(passenger)
158 |         return passengers
159 | 
160 | 
161 | class TitanicModel:
162 |     def __init__(self, n_neighbors=5, predictor=None):
163 |         if predictor is None:
164 |             predictor = LogisticRegression(random_state=0)
165 |         self.trained = False
166 |         self.one_hot_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
167 |         self.knn_imputer = KNNImputer(n_neighbors=n_neighbors)
168 |         self.robust_scaler = RobustScaler()
169 |         self.predictor = predictor
170 | 
171 |     def process_inputs(self, passengers):
172 |         data = pd.DataFrame([v.dict() for v in passengers])
173 |         categorical_data = data[['embarked', 'sex', 'pclass', 'title', 'is_alone']]
174 |         numerical_data = data[['age', 'fare', 'family_size']]
175 |         if self.trained:
176 |             categorical_data = self.one_hot_encoder.transform(categorical_data)
177 |             numerical_data = self.robust_scaler.transform(
178 |                 self.knn_imputer.transform(numerical_data)
179 |             )
180 |         else:
181 |             categorical_data = self.one_hot_encoder.fit_transform(categorical_data)
182 |             numerical_data = self.robust_scaler.fit_transform(
183 |                 self.knn_imputer.fit_transform(numerical_data)
184 |             )
185 |         return np.hstack((categorical_data, numerical_data))
186 | 
187 |     def train(self, passengers):
188 |         targets = [v.is_survived for v in passengers]
189 |         inputs = self.process_inputs(passengers)
190 |         self.predictor.fit(inputs, targets)
191 |         self.trained = True
192 | 
193 |     def estimate(self, passengers):
194 |         inputs = self.process_inputs(passengers)
195 |         return self.predictor.predict(inputs)
196 | 
197 | 
198 | class TitanicModelCreator:
199 |     def __init__(self, loader, model, model_saver):
200 |         self.loader = loader
201 |         self.model = model
202 |         self.model_saver = model_saver
203 |         np.random.seed(42)
204 | 
205 |     def split_passengers(self, passengers):
206 |         passengers_map = {p.pid: p for p in passengers}
207 |         pids = [passenger.pid for passenger in passengers]
208 |         targets = [passenger.is_survived for passenger in passengers]
209 |         train_pids, test_pids = train_test_split(pids, stratify=targets, test_size=0.2)
210 |         train_passengers = [passengers_map[pid] for pid in train_pids]
211 |         test_passengers = [passengers_map[pid] for pid in test_pids]
212 |         return train_passengers, test_passengers
213 | 
214 |     def run(self):
215 |         passengers = self.loader.get_passengers()
216 |         train_passengers, test_passengers = self.split_passengers(passengers)
217 | 
218 |         # --- TRAINING ---
219 |         self.model.train(train_passengers)
220 |         y_train_estimation = self.model.estimate(train_passengers)
221 |         cm_train = confusion_matrix(
222 |             [v.is_survived for v in train_passengers], y_train_estimation
223 |         )
224 | 
225 |         # --- TESTING ---
226 |         y_test_estimation = self.model.estimate(test_passengers)
227 |         cm_test = confusion_matrix(
228 |             [v.is_survived for v in test_passengers], y_test_estimation
229 |         )
230 | 
231 |         self.model_saver.save_model(
232 |             model=self.model,
233 |             result={
234 |                 'cm_train': cm_train,
235 |                 'cm_test': cm_test,
236 |                 'train_passengers': train_passengers,
237 |                 'test_passengers': test_passengers,
238 |             },
239 |         )
240 | 
241 | 
242 | def main(param: str = 'pass'):
243 |     titanic_model_creator = TitanicModelCreator(
244 |         loader=PassengerLoader(
245 |             loader=SqlLoader(connection_string='sqlite:///../data/titanic.db'),
246 |             rare_titles=RARE_TITLES,
247 |         ),
248 |         model=TitanicModel(),
249 |         model_saver=ModelSaver(
250 |             model_filename='../data/real_model.pkl',
251 |             result_filename='../data/real_result.pkl',
252 |         ),
253 |     )
254 |     titanic_model_creator.run()
255 | 
256 | 
257 | def test_main(param: str = 'pass'):
258 |     titanic_model_creator = TitanicModelCreator(
259 |         loader=PassengerLoader(
260 |             loader=TestLoader(
261 |                 passengers_filename='../data/passengers_with_is_survived.pkl',
262 |                 real_loader=SqlLoader(connection_string='sqlite:///../data/titanic.db'),
263 |             ),
264 |             rare_titles=RARE_TITLES,
265 |         ),
266 |         model=TitanicModel(n_neighbors=5, predictor=LogisticRegression(random_state=0)),
267 |         model_saver=TestModelSaver(),
268 |     )
269 |     titanic_model_creator.run()
270 | 
271 | 
272 | if __name__ == "__main__":
273 |     typer.run(test_main)
274 | 


--------------------------------------------------------------------------------
/Step22/titanic_model.sh:
--------------------------------------------------------------------------------
1 | python3.8 ./titanic_model.py $1
2 | 


--------------------------------------------------------------------------------
/create_branch.sh:
--------------------------------------------------------------------------------
 1 | 
 2 | switch=$1
 3 | 
 4 | if [ -z "$switch" ]; then
 5 |     echo "use ./create_branch -do"
 6 | else
 7 |     cp Step02/* Step01
 8 |     cp Step03/* Step02
 9 |     cp Step04/* Step03
10 |     cp Step05/* Step04
11 |     cp Step06/* Step05
12 |     cp Step07/* Step06
13 |     cp Step08/* Step07
14 |     cp Step09/* Step08
15 |     cp Step10/* Step09
16 |     cp Step11/* Step10
17 |     cp Step12/* Step11
18 |     cp Step13/* Step12
19 |     cp Step14/* Step13
20 |     cp Step15/* Step14
21 |     cp Step16/* Step15
22 |     cp Step17/* Step16
23 |     cp Step18/* Step17
24 |     cp Step19/* Step18
25 |     cp Step20/* Step19
26 |     cp Step21/* Step20
27 |     cp Step22/* Step21
28 | fi
29 | 


--------------------------------------------------------------------------------
/create_instructions.sh:
--------------------------------------------------------------------------------
1 | find . -name "README??.md" | sort | xargs cat > INSTRUCTIONS.md
2 | 


--------------------------------------------------------------------------------
/create_sqlite.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | import pandas as pd
 3 | import typer
 4 | from sklearn.datasets import fetch_openml
 5 | from sqlalchemy import create_engine
 6 | 
 7 | 
 8 | def main():
 9 |     print('loading data')
10 |     df, targets = fetch_openml("titanic", version=1, as_frame=True, return_X_y=True)
11 |     targets = pd.DataFrame(np.array([int(v) for v in targets]), columns=['is_survived'])
12 | 
13 |     print('creating db')
14 |     engine = create_engine('sqlite:///titanic.db', echo=True)
15 |     sqlite_connection = engine.connect()
16 | 
17 |     print('saving passengers')
18 |     df.to_sql('tbl_passengers', sqlite_connection, index_label='pid')
19 |     print('saving targets')
20 |     targets.to_sql('tbl_targets', sqlite_connection, index_label='pid')
21 | 
22 |     print('closing db')
23 |     sqlite_connection.close()
24 | 
25 |     print('done')
26 | 
27 | 
28 | if __name__ == "__main__":
29 |     typer.run(main)
30 | 


--------------------------------------------------------------------------------
/data/titanic.db:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/xLaszlo/CQ4DS-notebook-sklearn-refactoring-exercise/16059caade85caeb6f4ccf7e492e1d67f87d28a5/data/titanic.db


--------------------------------------------------------------------------------
/make_venv.sh:
--------------------------------------------------------------------------------
1 | python3.8 -m venv .venv
2 | source .venv/bin/activate
3 | pip3 install --upgrade pip
4 | pip3 install setuptools==57.1.0
5 | pip3 install wheel
6 | pip3 install -r requirements.txt
7 | 


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | typer
2 | scikit-learn
3 | jupyter
4 | pandas
5 | numpy
6 | sqlalchemy
7 | pydantic
8 | black
9 | 


--------------------------------------------------------------------------------
/test_all.sh:
--------------------------------------------------------------------------------
 1 | rm data/*.pkl
 2 | echo "Step05" && cd Step05 && ./titanic_model.sh && cd ..
 3 | echo "Step06" && cd Step06 && ./titanic_model.sh && cd ..
 4 | echo "Step07" && cd Step07 && ./titanic_model.sh && cd ..
 5 | echo "Step08" && cd Step08 && ./titanic_model.sh && cd ..
 6 | echo "Step09" && cd Step09 && ./titanic_model.sh && cd ..
 7 | echo "Step10" && cd Step10 && ./titanic_model.sh && cd ..
 8 | echo "Step11" && cd Step11 && ./titanic_model.sh && cd ..
 9 | echo "Step12" && cd Step12 && ./titanic_model.sh && cd ..
10 | echo "Step13" && cd Step13 && ./titanic_model.sh && cd ..
11 | echo "Step14" && cd Step14 && ./titanic_model.sh && cd ..
12 | echo "Step15" && cd Step15 && ./titanic_model.sh && cd ..
13 | echo "Step16" && cd Step16 && ./titanic_model.sh && cd ..
14 | echo "Step17" && cd Step17 && ./titanic_model.sh && cd ..
15 | echo "Step18" && cd Step18 && ./titanic_model.sh && cd ..
16 | echo "Step19" && cd Step19 && ./titanic_model.sh && cd ..
17 | echo "Step20" && cd Step20 && ./titanic_model.sh && cd ..
18 | echo "Step21" && cd Step21 && ./titanic_model.sh && cd ..
19 | echo "Step22" && cd Step22 && ./titanic_model.sh && cd ..
20 | 


--------------------------------------------------------------------------------