├── .gitignore
├── INSTRUCTIONS.md
├── README.md
├── Step00
├── README.md
└── Slide0_Notebook.ipynb
├── Step01
├── .gitkeep
└── README00.md
├── Step02
├── README01.md
├── make_venv.sh
└── requirements.txt
├── Step03
├── README02.md
├── make_venv.sh
├── requirements.txt
├── titanic_model.py
└── titanic_model.sh
├── Step04
├── README03.md
├── make_venv.sh
├── requirements.txt
├── titanic_model.py
└── titanic_model.sh
├── Step05
├── README04.md
├── make_venv.sh
├── requirements.txt
├── titanic_model.py
└── titanic_model.sh
├── Step06
├── README05.md
├── make_venv.sh
├── requirements.txt
├── titanic_model.py
└── titanic_model.sh
├── Step07
├── README06.md
├── make_venv.sh
├── requirements.txt
├── titanic_model.py
└── titanic_model.sh
├── Step08
├── README07.md
├── make_venv.sh
├── requirements.txt
├── titanic_model.py
└── titanic_model.sh
├── Step09
├── README08.md
├── make_venv.sh
├── requirements.txt
├── titanic_model.py
└── titanic_model.sh
├── Step10
├── README09.md
├── make_venv.sh
├── requirements.txt
├── titanic_model.py
└── titanic_model.sh
├── Step11
├── README10.md
├── make_venv.sh
├── requirements.txt
├── titanic_model.py
└── titanic_model.sh
├── Step12
├── README11.md
├── make_venv.sh
├── requirements.txt
├── titanic_model.py
└── titanic_model.sh
├── Step13
├── README12.md
├── make_venv.sh
├── requirements.txt
├── titanic_model.py
└── titanic_model.sh
├── Step14
├── README13.md
├── make_venv.sh
├── requirements.txt
├── titanic_model.py
└── titanic_model.sh
├── Step15
├── README14.md
├── make_venv.sh
├── requirements.txt
├── titanic_model.py
└── titanic_model.sh
├── Step16
├── README15.md
├── make_venv.sh
├── requirements.txt
├── titanic_model.py
└── titanic_model.sh
├── Step17
├── README16.md
├── make_venv.sh
├── requirements.txt
├── titanic_model.py
└── titanic_model.sh
├── Step18
├── README17.md
├── make_venv.sh
├── requirements.txt
├── titanic_model.py
└── titanic_model.sh
├── Step19
├── README18.md
├── make_venv.sh
├── requirements.txt
├── titanic_model.py
└── titanic_model.sh
├── Step20
├── README19.md
├── make_venv.sh
├── requirements.txt
├── titanic_model.py
└── titanic_model.sh
├── Step21
├── README20.md
├── make_venv.sh
├── requirements.txt
├── titanic_model.py
└── titanic_model.sh
├── Step22
├── README21.md
├── make_venv.sh
├── requirements.txt
├── titanic_model.py
└── titanic_model.sh
├── create_branch.sh
├── create_instructions.sh
├── create_sqlite.py
├── data
└── titanic.db
├── make_venv.sh
├── requirements.txt
└── test_all.sh
/.gitignore:
--------------------------------------------------------------------------------
1 | .venv/
2 | .vscode/
3 | .idea/
4 | .ipynb_checkpoints/
5 | __pycache__
6 | __pycache__/*
7 | data/*.pkl
8 |
--------------------------------------------------------------------------------
/INSTRUCTIONS.md:
--------------------------------------------------------------------------------
1 | ### Step 01: Project setup
2 |
3 | - Write script to create virtual environment
4 | - Write the first `requirements.txt`
5 |
6 | You can select a different `setuptools` version or pin the package versions.
7 | ### Step 02: Code setup
8 |
9 | - Write python script stub with `typer`
10 | - Write shell script to execute the python script
11 |
12 | `Typer` is an amazing tool that turns any python script into shell scripts. Here we use it for future-proofing because at the moment there are no CLI arguments.
13 |
14 | The program will be defined in a class that is instantiated by the `main()` function and call its main `run()` entry point. The `main()` function will be called by `typer` to pass any CLI parameters. This setup will allow us to create a "plugin" architecture and construct different behaviour (e.g.: normal, test, production) in different main functions. This is a form of "Clean Architecture" where the code (the class) is independent of the infrastructure that calls it (`main()`) more on this: [Clean Architecture: How to structure your ML projects to reduce technical debt (PyData London 2022)](https://laszlo.substack.com/p/slides-for-my-talk-at-pydata-london).
15 | ### Step 03: Move code out of the notebook
16 |
17 | - Copy-paste everything into the `run()` function
18 |
19 | First step is to get started. There will be plenty of steps to structure the code better.
20 | ### Step 04: Move over the tests
21 |
22 | - Copy paste tests and testing code from the notebook in `Step00` into the `run()` function.
23 |
24 | This will implement very simple end-to-end testing which is less effort than unit testing given that the code is not really in a testable state. It caches the value of some variables and the next time you run the code it will compare it to this cache. If they match you didn't change the behaviour of the code with the last change. If your intentions was indeed to change the behaviour, verify from the output of the `AssertionError` that the changes are working as intended. If they are, delete the chaches and rerun the code to generate new reference values. The tests should be such that if they fail they produce meaningful differences. So instead of aggregate statistics (like an F1 score) test the datasets itself. That way even small changes won't go undetected. Once the code is refactored you can write different type of tests but that's a different story.
25 | ### Step 05: Decouple from the database
26 |
27 | - Write `SQLLoader` class
28 | - Move database related code into it
29 | - Replace database calls with interface calls in `run()`
30 |
31 | This is a typical example of the Adapter Pattern. Instead of directly calling the DB, we access it through an intermediary preparing for establishing "Loose Coupling" and "Dependency Inversion". In Clean Architecture the main code (the `run()` function) shouldn't know where the data is coming from, just what the data is. This will bring flexibility because this adapter can be replaced with another one that has the same interface but gets the data from a file. After that you can run your main code without a database which makes it more testable. More on this: [You only need 2 Design Patterns to improve the quality of your code in a data science project](https://laszlo.substack.com/p/you-only-need-2-design-patterns-to).
32 | ### Step 06: Decouple from the database
33 |
34 | - Create loader propery and argument in `TitanicModelCreator.__init__()`
35 | - Remove the database loader instantiation from the `run()` function
36 | - Update `TitanicModelCreator` construction to create the loader there
37 |
38 | This will enable for the `TitanicModelCreator` to load data from any source for example files. Preparing to build a test context for rapid iteration. After you created the adapter class, this will do the decoupling. This is an example of "Dependency Injection", when a property of your main code is not written into the main body of the code but instead "plugged in" at constrcution time. The benefit of Dependency Injection is that you can change the behaviour of your code without rewriting it by purely changing its construction. As the saying goes: "Complex behaviour is constructed not written." Dependency Injection Principle is the `D` in the famed `SOLID` principles, and arguably the most important.
39 | ### Step 07: Write testing dataloader
40 |
41 | - Write a class that loads the required data from files
42 | - Same interface as `SqlLoader`
43 | - Add a "real" loader to it as a property
44 |
45 | This will allow the test context to work without DB connection and still have the DB as a fallback when you run it for the first time. For `TitanicModelCreator` the two loaders are indistinguishable as they have the same interface.
46 | ### Step 08: Write the test context
47 |
48 | - Create `test_main` function
49 | - Make sure `typer` calls that in `typer.run()`
50 | - Copy the code from `main` to `test_main`
51 | - Replace `SqlLoader` in it with `TestLoader`
52 |
53 | From now on this is the only code that is tested. The costly connection to the DB is replaced with a file load. Also if it is still not fast enough, additional parameter can reduce the amount of data in the test to make the process faster. [How can a Data Scientist refactor Jupyter notebooks towards production-quality code?](https://laszlo.substack.com/p/how-can-a-data-scientist-refactor) [I appreciate this might be terse. Comment, open an issue, vote on it if you would like to have a detailed discussion on this - Laszlo]
54 |
55 | This is the essence of the importance of Clean Architecture and code reuse. Every code will be used in two different context: test and "production" by injecting different dependencies. Because the same code runs in both places there is no time spent on translating from one to another. The test setup should reflect production context as close as possible so when a test fail or pass you can think that the same will happen in production as well. This speed up iteration because you can freely experiment in the test context and only deploy code into "production" when you are convinced it is doing what you think it should do. But it is the same code, so deployment is effortless.
56 | ### Step 09: Merge passenger data with targets
57 |
58 | - Remove the `get_targets()` interface
59 | - Replace the query in `SqlLoader`
60 | - Remove any code related to `targets`
61 |
62 | This is a step to prepare to build the "domain data model". The Titanic model is about survival of her passengers. For the code to align this domain the concept of "passengers" need to be introduced (as a class/object). A passenger either survived or not, it's an attribute of the passenger and it need to be implemented like that.
63 |
64 | This is a critical part of the code quality journey and building better systems. Once you introduce these concepts your code will depend directly on the business problem you are solving not the various representations the data is stored (pandas, numpy, csv, etc). I wrote about this many times on my blog:
65 |
66 | - [3 Ways Domain Data Models help Data Science Projects](https://laszlo.substack.com/p/3-ways-domain-data-models-help-data)
67 | - [Clean Architecture: How to structure your ML projects to reduce technical debt](https://laszlo.substack.com/p/slides-for-my-talk-at-pydata-london)
68 | - [How did I change my mind about dataclasses in ML projects?](https://laszlo.substack.com/p/how-did-i-change-my-mind-about-dataclasses)
69 | ### Step 10: Create Passenger class
70 |
71 | - Import `BaseModel` from `pydantic`
72 | - Create the class by inspecting:
73 | - The `dtype` of columns used in `df`
74 | - The actual values in `df`
75 | - The names of the columns that are used later in the code
76 |
77 | There is really no shortcut here. In a "real" project defining this class would be the first step, but in legacy you need to deal with it later. The benefit of domain data objects is that any time you use them you can assume they fulfill a set of assumptions. These can be made explicit with `pydantic's` validators. One goal of the refactoring is to make sure that most interaction between classes happen with domain data objects. This simplifies structuring the project, any future data related change has a well defined place to happen.
78 | ### Step 11: Create domain data object based data loader
79 |
80 | - Create `PassengerLoader` class that takes a "real"/"old" loader
81 | - In its `get_passengers` function, load the data from the loader and create the `Passenger` objects
82 | - Copy the data transformations from `TitanicModelCreator.run()`
83 |
84 | Take a look at how the `rare_titles` variable is used in `run()`. After scanning the entire dataset for titles, the ones that appear less than 10 times are selected. This can be done only if you have access to the entire database and this list needs to be maintained. This can cause problems in a real setting when the above operation is too difficult to do. For example if you have millions of items or a constant stream. This kind of dependencies are common in legacy code and one of the goals of refactoring is to identify these and make explicit. Here we will use a constant but in a productionised environment this might need a whole separate service.
85 |
86 | `PassengerLoader` implements the Factory Design Pattern. Factories are classes that create other classes, they are a type of adapter that hides away where the data is coming from and how is it stored and return only abstract domain relevant classes that you can use downstream. Factories are one of two (later increased to three) fundamentally relevant Design Patterns for Data Science workflows:
87 |
88 | - [You only need 2 Design Patterns to improve the quality of your code in a data science project](https://laszlo.substack.com/p/you-only-need-2-design-patterns-to)
89 | - [Clean Architecture: How to structure your ML projects to reduce technical debt](https://laszlo.substack.com/p/slides-for-my-talk-at-pydata-london)
90 | ### Step 12: Remove any data that is not explicitly needed
91 |
92 | - Update the query in `SqlLoader` to only retrieve the columns that will be used for the model's input
93 |
94 | Simplifying down to the minimum is a goal of refactoring. Anything that is not explicitly needed should be removed. If the requirements change they can be added back again. For example the `ticket` column is in `df` but it is never used again in the program. Remove it.
95 | ### Step 13: Use Passenger objects in the program
96 |
97 | - Add `PassengerLoader` to `main` and `test_main`
98 | - Add the `RARE_TITLES` constant
99 | - Convert the classes back into the `df` dataframe with `passenger.dict()`
100 |
101 | It is very important to do refactoring incrementally. Any change should be small enough that if the tests fail the source can be found quickly. So for now we stop at using the new loader but do not change anything else.
102 | ### Step 14: Separate training and evaluation functions
103 |
104 | - Move all code related to evaluation (variables that has `_test_` in their name) into one group
105 |
106 | After creating the model first it is trained, then it is evaluated on the training data, then it is evaluated on the testing data. These should be separated from each other into their own logical place. This will prepare to move them into an actually separated place.
107 | ### Step 15: Create `TitanicModel` class
108 |
109 | - Create a class that has all the `sklearn` components as member variables
110 | - Instantiate these before the "Training" block
111 | - Use these instead of the local ones
112 |
113 | The goal of the whole program is to create a model, despite this until now there was no single object describing this model. The next steps is to establish the concept of this model and what kind of services it is providing for `TitanicModelCreator`.
114 | ### Step 16: Passenger class based training and evaluation sets
115 |
116 | - Create a function in `TitanicModelCreator` that splits the `passengers` stratified by the "targets" (namely if the passenger survived or not)
117 | - Refactor `X_train/X_test` to be created from these lists of passengers
118 |
119 | Because `train_test_split` works on lists, we extract the pids and the targets from the classes and create the two sets from a mapping from pids to passenger classes.
120 | ### Step 17: Create input processing for `TitanicModel`
121 |
122 | - Move code in `run()` from between instantiating `TitanicModel` and training (`model.predictor.fit`) to the `process_inputs` function of `TitanicModel`.
123 | - Introduce `self.trained` boolean
124 | - Based on `self.trained` either call the `transform` or `fit_transform` of the `sklearn` input processor functions
125 |
126 | All the input transformation code happen twice. Once for training data once for evaluation data. While transforming the data is a responsibility of the model. This is a codesmell called "feature envy". `TitanicModelCreator` envies the functionality from `TitanicModel`. There will be several steps to resolve this. The resulting code will create a self contained model that can be shipped independetly from its creator.
127 |
128 | ### Step 18: Move training into `TitanicModel`
129 |
130 | - Use the same interface as `process_inputs` with `train()`
131 | - Process the data with `process_inputs` (just pass through the arguments)
132 | - Recreate the required targets with the mapping
133 | - Train the model and set the `trained` boolean to `True`
134 | ### Step 19: Move prediction to `TitanicModel`
135 |
136 | - Create the `estimate` function
137 | - Call `proccess_inputs` and `predictor.predict` in it
138 | - Remove all evaluation input processing code
139 | - Call `estimate` from `run`
140 |
141 | Because there was no separation of concerns the input processing code was duplicated and now that we moved it to its own location it can be removed.
142 |
143 | `X_train_processed` and `X_test_processed` do not exist any more so to pass the tests they need to be recreated. This is a good point to think about why this is necessary and find a different way to test behaviour. To keep the project short we set aside this but this would be a good place to introduce more tests.
144 | ### Step 20: Save model and move tests to custom model savers
145 |
146 | - Create `ModelSaver` that has a `save_model` interface that accepts a model and a result object
147 | - Pickle the model and the result to a file
148 | - Create `TestModelSaver` that has the same interface
149 | - Move the testing code to the `save_model` functon
150 | - Add `model_saver` property to `TitanicModelCreator` and call it after the evaluation code
151 | - Add an instance of `ModelSaver` and `TestModelSaver` respectively in `main` and `test_main` to the construction of `TitanicModelCreator`
152 |
153 | Currently `TitanicModelCreator` contains its own testing, while this is intended to run in production. It also have no way to save the model. We will introduce the concept of `ModelSaver` here, anything that need to be preserved after the model training need to be passed to this class.
154 |
155 | We will also move testing into a specific `TestModelSaver` that will instead of saving the model, will run the tests that were otherwise be in `run()`. This way the same code can run in production and in testing without change.
156 | ### Step 21: Enable training of different models
157 |
158 | - Add `model` property to `TitanicModelCreator` and use in `run()` that instead of the local `TitanicModel` instance.
159 | - Add `TitanicModel` instantiation to the creation of `TitanicModelCreator` in both `main` and `test_main`
160 | - Expose parts of `TitanicModel` (predictor, processing parameter)
161 |
162 | At this point the refactoring is pretty much finished. This last step enables the creation of different models. Use existing implementations as templates to create new shell scripts, main functions (contexts) for each experiment that uses new Loaders to create new datasets. Write different test context to make sure the changes you do are as intended.As more experiments emerge, you will see patterns and opportunities to extract common behaviour from similar implementation while still maintaining validity through thes tests. This allows restructuring your code on the fly and find out what is the most convenient architecture for your system. Most problems in these systems are unforeseeable, there is no possibility to figure out the best structure before you start implementation. This require a workflow that enables radical changes even at later stages of the project. Clean Architecture, end-to-end testing and maintaining code quality provides exactly this feature at very low effort.
163 |
164 | Next steps:
165 |
166 | - Use different data:
167 | - Update `SqlLoader` to retrieve different data
168 | - Update `Passenger` class to contain this new data
169 | - Update `PassengerLoader` class to process this new data into the classes
170 | - Update `process_inputs` to create features out of this new data
171 | - Use different features
172 | - Update `process_inputs` in `TitanicModel`, expose parameters as needed
173 | - Use different model:
174 | - Use different `predictor` in `TitanicModel`
175 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # CQ4DS Notebook Sklearn Refactoring Exercise
2 |
3 | This step-by-step programme demonstrates how to refactor a Data Science project from notebooks to well-formed classes and scripts.
4 |
5 | ### The project:
6 |
7 | The notebook demonstrates a typical setup of a data science project:
8 |
9 | - Connects to a database (included in the repository as an SQLite file).
10 | - Gathers some data (the classic Titanic example).
11 | - Does feature engineering.
12 | - Fits a model to estimate survival (sklearn's LogisticRegression).
13 | - Evaluates the model.
14 |
15 | ### Context, vision,
16 |
17 | I wrote a detailed post on the concepts, strategy and big picture thinking. I recommend reading it parallel with the instructions and the steps in the pull request while you are doing the exercises:
18 |
19 | [https://laszlo.substack.com/p/refactoring-the-titanic](https://laszlo.substack.com/p/refactoring-the-titanic)
20 |
21 | ### Refactoring
22 |
23 | The programme demonstrates how to improve code quality, increase agility and prepare for unforeseen changes in a real-world project (see `INSTRUCTIONS.md` for reference reading). You will perform the following steps:
24 |
25 | - Create end-to-end functional testing
26 | - Create shell scripts, command line interfaces, virtual environments
27 | - Decouple from external sources (the Database)
28 | - Refactor with simple Design Patterns (Adapter/Factory/Strategy)
29 | - Improve readability
30 | - Reduce code duplication
31 |
32 | ### Howto:
33 |
34 | - Clone the repository.
35 | - Create a virtual environment with `make_venv.sh`.
36 | - Follow the instructions in `INSTRUCTIONS.md`.
37 | - Run the tests with `titanic_model.sh`.
38 | - Check the diffs of the pull request's steps to verify your progress.
39 |
40 | ### Community:
41 |
42 | For more information and help, join our interactive self-help Code Quality for Data Science (CQ4DS) community on discord: [https://discord.gg/8uUZNMCad2](https://discord.gg/8uUZNMCad2).
43 |
44 | Original project content from and inspired by: [https://jaketae.github.io/study/sklearn-pipeline/](https://jaketae.github.io/study/sklearn-pipeline/)
45 |
--------------------------------------------------------------------------------
/Step00/README.md:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/xLaszlo/CQ4DS-notebook-sklearn-refactoring-exercise/16059caade85caeb6f4ccf7e492e1d67f87d28a5/Step00/README.md
--------------------------------------------------------------------------------
/Step00/Slide0_Notebook.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "code",
5 | "execution_count": 1,
6 | "metadata": {},
7 | "outputs": [],
8 | "source": [
9 | "import os\n",
10 | "import pickle\n",
11 | "import numpy as np\n",
12 | "import pandas as pd\n",
13 | "from collections import Counter\n",
14 | "from sqlalchemy import create_engine\n",
15 | "\n",
16 | "from sklearn.model_selection import train_test_split\n",
17 | "from sklearn.linear_model import LogisticRegression\n",
18 | "from sklearn.preprocessing import RobustScaler\n",
19 | "from sklearn.preprocessing import OneHotEncoder\n",
20 | "from sklearn.impute import KNNImputer\n",
21 | "from sklearn.metrics import confusion_matrix"
22 | ]
23 | },
24 | {
25 | "cell_type": "code",
26 | "execution_count": 2,
27 | "metadata": {},
28 | "outputs": [
29 | {
30 | "data": {
31 | "text/html": [
32 | "
\n",
33 | "\n",
46 | "
\n",
47 | " \n",
48 | " \n",
49 | " | \n",
50 | " type | \n",
51 | " name | \n",
52 | " tbl_name | \n",
53 | " rootpage | \n",
54 | " sql | \n",
55 | "
\n",
56 | " \n",
57 | " \n",
58 | " \n",
59 | " 0 | \n",
60 | " table | \n",
61 | " tbl_passengers | \n",
62 | " tbl_passengers | \n",
63 | " 2 | \n",
64 | " CREATE TABLE tbl_passengers (\\n\\tpid BIGINT, \\... | \n",
65 | "
\n",
66 | " \n",
67 | " 1 | \n",
68 | " table | \n",
69 | " tbl_targets | \n",
70 | " tbl_targets | \n",
71 | " 35 | \n",
72 | " CREATE TABLE tbl_targets (\\n\\tpid BIGINT, \\n\\t... | \n",
73 | "
\n",
74 | " \n",
75 | "
\n",
76 | "
"
77 | ],
78 | "text/plain": [
79 | " type name tbl_name rootpage \\\n",
80 | "0 table tbl_passengers tbl_passengers 2 \n",
81 | "1 table tbl_targets tbl_targets 35 \n",
82 | "\n",
83 | " sql \n",
84 | "0 CREATE TABLE tbl_passengers (\\n\\tpid BIGINT, \\... \n",
85 | "1 CREATE TABLE tbl_targets (\\n\\tpid BIGINT, \\n\\t... "
86 | ]
87 | },
88 | "execution_count": 2,
89 | "metadata": {},
90 | "output_type": "execute_result"
91 | }
92 | ],
93 | "source": [
94 | "engine = create_engine('sqlite:///../data/titanic.db')\n",
95 | "sqlite_connection = engine.connect()\n",
96 | "pd.read_sql('SELECT * FROM sqlite_schema WHERE type=\"table\"', con=sqlite_connection)"
97 | ]
98 | },
99 | {
100 | "cell_type": "code",
101 | "execution_count": 3,
102 | "metadata": {},
103 | "outputs": [],
104 | "source": [
105 | "np.random.seed(42)\n",
106 | "\n",
107 | "df = pd.read_sql('SELECT * FROM tbl_passengers', con=sqlite_connection)\n",
108 | "\n",
109 | "targets = pd.read_sql('SELECT * FROM tbl_targets', con=sqlite_connection)\n",
110 | "\n",
111 | "# df, targets = fetch_openml(\"titanic\", version=1, as_frame=True, return_X_y=True)\n",
112 | "\n",
113 | "# parch = Parents/Children, sibsp = Siblings/Spouses\n",
114 | "df['family_size'] = df['parch'] + df['sibsp']\n",
115 | "df['is_alone'] = [1 if family_size==1 else 0 for family_size in df['family_size']]\n",
116 | "\n",
117 | "df['title'] = [name.split(',')[1].split('.')[0].strip() for name in df['name']]\n",
118 | "rare_titles = {k for k,v in Counter(df['title']).items() if v < 10}\n",
119 | "df['title'] = ['rare' if title in rare_titles else title for title in df['title']]\n",
120 | "\n",
121 | "df = df[[\n",
122 | " 'pclass', 'sex', 'age', 'ticket', 'family_size',\n",
123 | " 'fare', 'embarked', 'is_alone', 'title'\n",
124 | "]]\n",
125 | "\n",
126 | "targets = [int(v) for v in targets['is_survived']]\n",
127 | "X_train, X_test, y_train, y_test = train_test_split(df, targets, stratify=targets, test_size=0.2)"
128 | ]
129 | },
130 | {
131 | "cell_type": "code",
132 | "execution_count": 4,
133 | "metadata": {},
134 | "outputs": [
135 | {
136 | "data": {
137 | "text/html": [
138 | "\n",
139 | "\n",
152 | "
\n",
153 | " \n",
154 | " \n",
155 | " | \n",
156 | " pclass | \n",
157 | " sex | \n",
158 | " age | \n",
159 | " ticket | \n",
160 | " family_size | \n",
161 | " fare | \n",
162 | " embarked | \n",
163 | " is_alone | \n",
164 | " title | \n",
165 | "
\n",
166 | " \n",
167 | " \n",
168 | " \n",
169 | " 0 | \n",
170 | " 1.0 | \n",
171 | " female | \n",
172 | " 29.0000 | \n",
173 | " 24160 | \n",
174 | " 0.0 | \n",
175 | " 211.3375 | \n",
176 | " S | \n",
177 | " 0 | \n",
178 | " Miss | \n",
179 | "
\n",
180 | " \n",
181 | " 1 | \n",
182 | " 1.0 | \n",
183 | " male | \n",
184 | " 0.9167 | \n",
185 | " 113781 | \n",
186 | " 3.0 | \n",
187 | " 151.5500 | \n",
188 | " S | \n",
189 | " 0 | \n",
190 | " Master | \n",
191 | "
\n",
192 | " \n",
193 | " 2 | \n",
194 | " 1.0 | \n",
195 | " female | \n",
196 | " 2.0000 | \n",
197 | " 113781 | \n",
198 | " 3.0 | \n",
199 | " 151.5500 | \n",
200 | " S | \n",
201 | " 0 | \n",
202 | " Miss | \n",
203 | "
\n",
204 | " \n",
205 | "
\n",
206 | "
"
207 | ],
208 | "text/plain": [
209 | " pclass sex age ticket family_size fare embarked is_alone \\\n",
210 | "0 1.0 female 29.0000 24160 0.0 211.3375 S 0 \n",
211 | "1 1.0 male 0.9167 113781 3.0 151.5500 S 0 \n",
212 | "2 1.0 female 2.0000 113781 3.0 151.5500 S 0 \n",
213 | "\n",
214 | " title \n",
215 | "0 Miss \n",
216 | "1 Master \n",
217 | "2 Miss "
218 | ]
219 | },
220 | "execution_count": 4,
221 | "metadata": {},
222 | "output_type": "execute_result"
223 | }
224 | ],
225 | "source": [
226 | "df[:3]"
227 | ]
228 | },
229 | {
230 | "cell_type": "code",
231 | "execution_count": 5,
232 | "metadata": {},
233 | "outputs": [],
234 | "source": [
235 | "X_train_categorical = X_train[['embarked', 'sex', 'pclass', 'title', 'is_alone']]\n",
236 | "X_test_categorical = X_test[['embarked', 'sex', 'pclass', 'title', 'is_alone']]\n",
237 | "\n",
238 | "one_hot_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False).fit(X_train_categorical)\n",
239 | "X_train_categorical_one_hot = one_hot_encoder.transform(X_train_categorical)\n",
240 | "X_test_categorical_one_hot = one_hot_encoder.transform(X_test_categorical)"
241 | ]
242 | },
243 | {
244 | "cell_type": "code",
245 | "execution_count": 6,
246 | "metadata": {},
247 | "outputs": [],
248 | "source": [
249 | "X_train_numerical = X_train[['age', 'fare', 'family_size']]\n",
250 | "X_test_numerical = X_test[['age', 'fare', 'family_size']]\n",
251 | "knn_imputer = KNNImputer(n_neighbors=5).fit(X_train_numerical)\n",
252 | "X_train_numerical_imputed = knn_imputer.transform(X_train_numerical)\n",
253 | "X_test_numerical_imputed = knn_imputer.transform(X_test_numerical)"
254 | ]
255 | },
256 | {
257 | "cell_type": "code",
258 | "execution_count": 7,
259 | "metadata": {},
260 | "outputs": [],
261 | "source": [
262 | "robust_scaler = RobustScaler().fit(X_train_numerical_imputed)\n",
263 | "X_train_numerical_imputed_scaled = robust_scaler.transform(X_train_numerical_imputed)\n",
264 | "X_test_numerical_imputed_scaled = robust_scaler.transform(X_test_numerical_imputed)"
265 | ]
266 | },
267 | {
268 | "cell_type": "code",
269 | "execution_count": 8,
270 | "metadata": {},
271 | "outputs": [],
272 | "source": [
273 | "X_train_processed = np.hstack((X_train_categorical_one_hot, X_train_numerical_imputed_scaled))\n",
274 | "X_test_processed = np.hstack((X_test_categorical_one_hot, X_test_numerical_imputed_scaled))"
275 | ]
276 | },
277 | {
278 | "cell_type": "code",
279 | "execution_count": 9,
280 | "metadata": {},
281 | "outputs": [],
282 | "source": [
283 | "model = LogisticRegression(random_state=0).fit(X_train_processed, y_train)\n",
284 | "y_train_estimation = model.predict(X_train_processed)\n",
285 | "y_test_estimation = model.predict(X_test_processed)"
286 | ]
287 | },
288 | {
289 | "cell_type": "code",
290 | "execution_count": 10,
291 | "metadata": {},
292 | "outputs": [],
293 | "source": [
294 | "cm_train = confusion_matrix(y_train, y_train_estimation)"
295 | ]
296 | },
297 | {
298 | "cell_type": "code",
299 | "execution_count": 11,
300 | "metadata": {},
301 | "outputs": [],
302 | "source": [
303 | "cm_test = confusion_matrix(y_test, y_test_estimation)"
304 | ]
305 | },
306 | {
307 | "cell_type": "code",
308 | "execution_count": 12,
309 | "metadata": {},
310 | "outputs": [
311 | {
312 | "data": {
313 | "text/plain": [
314 | "array([[553, 94],\n",
315 | " [107, 293]])"
316 | ]
317 | },
318 | "execution_count": 12,
319 | "metadata": {},
320 | "output_type": "execute_result"
321 | }
322 | ],
323 | "source": [
324 | "cm_train"
325 | ]
326 | },
327 | {
328 | "cell_type": "code",
329 | "execution_count": 13,
330 | "metadata": {},
331 | "outputs": [
332 | {
333 | "data": {
334 | "text/plain": [
335 | "array([[142, 20],\n",
336 | " [ 22, 78]])"
337 | ]
338 | },
339 | "execution_count": 13,
340 | "metadata": {},
341 | "output_type": "execute_result"
342 | }
343 | ],
344 | "source": [
345 | "cm_test"
346 | ]
347 | },
348 | {
349 | "cell_type": "code",
350 | "execution_count": 14,
351 | "metadata": {},
352 | "outputs": [
353 | {
354 | "name": "stdout",
355 | "output_type": "stream",
356 | "text": [
357 | "../data/cm_test.pkl test passed\n",
358 | "../data/cm_train.pkl test passed\n",
359 | "../data/X_train_processed.pkl test passed\n",
360 | "../data/X_test_processed.pkl test passed\n"
361 | ]
362 | }
363 | ],
364 | "source": [
365 | "def do_test(filename, data):\n",
366 | " if not os.path.isfile(filename):\n",
367 | " pickle.dump(data, open(filename, 'wb'))\n",
368 | " truth = pickle.load(open(filename, 'rb'))\n",
369 | " try:\n",
370 | " np.testing.assert_almost_equal(data, truth)\n",
371 | " print(f'{filename} test passed')\n",
372 | " except AssertionError as ex:\n",
373 | " print(f'{filename} test failed {ex}')\n",
374 | " \n",
375 | "do_test('../data/cm_test.pkl', cm_test)\n",
376 | "do_test('../data/cm_train.pkl', cm_train)\n",
377 | "do_test('../data/X_train_processed.pkl', X_train_processed)\n",
378 | "do_test('../data/X_test_processed.pkl', X_test_processed)\n"
379 | ]
380 | },
381 | {
382 | "cell_type": "code",
383 | "execution_count": 15,
384 | "metadata": {},
385 | "outputs": [
386 | {
387 | "name": "stdout",
388 | "output_type": "stream",
389 | "text": [
390 | "../data/df.pkl pandas test passed\n"
391 | ]
392 | }
393 | ],
394 | "source": [
395 | "def do_pandas_test(filename, data):\n",
396 | " if not os.path.isfile(filename):\n",
397 | " data.to_pickle(filename)\n",
398 | " truth = pd.read_pickle(filename)\n",
399 | " try:\n",
400 | " pd.testing.assert_frame_equal(data, truth)\n",
401 | " print(f'{filename} pandas test passed')\n",
402 | " except AssertionError as ex:\n",
403 | " print(f'{filename} pandas test failed {ex}')\n",
404 | " \n",
405 | "# df['title'] = ['asd' for v in df['title']]\n",
406 | "do_pandas_test('../data/df.pkl', df)"
407 | ]
408 | },
409 | {
410 | "cell_type": "code",
411 | "execution_count": 16,
412 | "metadata": {},
413 | "outputs": [
414 | {
415 | "data": {
416 | "text/plain": [
417 | "{'Capt',\n",
418 | " 'Col',\n",
419 | " 'Don',\n",
420 | " 'Dona',\n",
421 | " 'Dr',\n",
422 | " 'Jonkheer',\n",
423 | " 'Lady',\n",
424 | " 'Major',\n",
425 | " 'Mlle',\n",
426 | " 'Mme',\n",
427 | " 'Ms',\n",
428 | " 'Rev',\n",
429 | " 'Sir',\n",
430 | " 'the Countess'}"
431 | ]
432 | },
433 | "execution_count": 16,
434 | "metadata": {},
435 | "output_type": "execute_result"
436 | }
437 | ],
438 | "source": [
439 | "rare_titles"
440 | ]
441 | },
442 | {
443 | "cell_type": "code",
444 | "execution_count": null,
445 | "metadata": {},
446 | "outputs": [],
447 | "source": []
448 | }
449 | ],
450 | "metadata": {
451 | "kernelspec": {
452 | "display_name": "Python 3 (ipykernel)",
453 | "language": "python",
454 | "name": "python3"
455 | },
456 | "language_info": {
457 | "codemirror_mode": {
458 | "name": "ipython",
459 | "version": 3
460 | },
461 | "file_extension": ".py",
462 | "mimetype": "text/x-python",
463 | "name": "python",
464 | "nbconvert_exporter": "python",
465 | "pygments_lexer": "ipython3",
466 | "version": "3.8.10"
467 | },
468 | "vscode": {
469 | "interpreter": {
470 | "hash": "774712da715a3086605d6bf08e7144a3a7e717b0d5585da12e288357dd4c8f07"
471 | }
472 | }
473 | },
474 | "nbformat": 4,
475 | "nbformat_minor": 4
476 | }
477 |
--------------------------------------------------------------------------------
/Step01/.gitkeep:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/xLaszlo/CQ4DS-notebook-sklearn-refactoring-exercise/16059caade85caeb6f4ccf7e492e1d67f87d28a5/Step01/.gitkeep
--------------------------------------------------------------------------------
/Step01/README00.md:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/xLaszlo/CQ4DS-notebook-sklearn-refactoring-exercise/16059caade85caeb6f4ccf7e492e1d67f87d28a5/Step01/README00.md
--------------------------------------------------------------------------------
/Step02/README01.md:
--------------------------------------------------------------------------------
1 | ### Step 01: Project setup
2 |
3 | - Write script to create virtual environment
4 | - Write the first `requirements.txt`
5 |
6 | You can select a different `setuptools` version or pin the package versions.
7 |
--------------------------------------------------------------------------------
/Step02/make_venv.sh:
--------------------------------------------------------------------------------
1 | python3.8 -m venv .venv
2 | source .venv/bin/activate
3 | pip3 install --upgrade pip
4 | pip3 install setuptools==57.1.0
5 | pip3 install wheel
6 | pip3 install -r requirements.txt
7 |
--------------------------------------------------------------------------------
/Step02/requirements.txt:
--------------------------------------------------------------------------------
1 | typer
2 | scikit-learn
3 | jupyter
4 | pandas
5 | numpy
6 | sqlalchemy
7 | pydantic
8 |
--------------------------------------------------------------------------------
/Step03/README02.md:
--------------------------------------------------------------------------------
1 | ### Step 02: Code setup
2 |
3 | - Write python script stub with `typer`
4 | - Write shell script to execute the python script
5 |
6 | `Typer` is an amazing tool that turns any python script into shell scripts. Here we use it for future-proofing because at the moment there are no CLI arguments.
7 |
8 | The program will be defined in a class that is instantiated by the `main()` function and call its main `run()` entry point. The `main()` function will be called by `typer` to pass any CLI parameters. This setup will allow us to create a "plugin" architecture and construct different behaviour (e.g.: normal, test, production) in different main functions. This is a form of "Clean Architecture" where the code (the class) is independent of the infrastructure that calls it (`main()`) more on this: [Clean Architecture: How to structure your ML projects to reduce technical debt (PyData London 2022)](https://laszlo.substack.com/p/slides-for-my-talk-at-pydata-london).
9 |
--------------------------------------------------------------------------------
/Step03/make_venv.sh:
--------------------------------------------------------------------------------
1 | python3.8 -m venv .venv
2 | source .venv/bin/activate
3 | pip3 install --upgrade pip
4 | pip3 install setuptools==57.1.0
5 | pip3 install wheel
6 | pip3 install -r requirements.txt
7 |
--------------------------------------------------------------------------------
/Step03/requirements.txt:
--------------------------------------------------------------------------------
1 | typer
2 | scikit-learn
3 | jupyter
4 | pandas
5 | numpy
6 | sqlalchemy
7 | pydantic
8 |
--------------------------------------------------------------------------------
/Step03/titanic_model.py:
--------------------------------------------------------------------------------
1 | import typer
2 |
3 |
4 | class TitanicModelCreator:
5 | def __init__(self):
6 | pass
7 |
8 | def run(self):
9 | print('Hello World!')
10 |
11 |
12 | def main(param: str = 'pass'):
13 | titanic_model_creator = TitanicModelCreator()
14 | titanic_model_creator.run()
15 |
16 |
17 | if __name__ == "__main__":
18 | typer.run(main)
19 |
--------------------------------------------------------------------------------
/Step03/titanic_model.sh:
--------------------------------------------------------------------------------
1 | python3.8 ./titanic_model.py $1
2 |
--------------------------------------------------------------------------------
/Step04/README03.md:
--------------------------------------------------------------------------------
1 | ### Step 03: Move code out of the notebook
2 |
3 | - Copy-paste everything into the `run()` function
4 |
5 | First step is to get started. There will be plenty of steps to structure the code better.
6 |
--------------------------------------------------------------------------------
/Step04/make_venv.sh:
--------------------------------------------------------------------------------
1 | python3.8 -m venv .venv
2 | source .venv/bin/activate
3 | pip3 install --upgrade pip
4 | pip3 install setuptools==57.1.0
5 | pip3 install wheel
6 | pip3 install -r requirements.txt
7 |
--------------------------------------------------------------------------------
/Step04/requirements.txt:
--------------------------------------------------------------------------------
1 | typer
2 | scikit-learn
3 | jupyter
4 | pandas
5 | numpy
6 | sqlalchemy
7 | pydantic
8 |
--------------------------------------------------------------------------------
/Step04/titanic_model.py:
--------------------------------------------------------------------------------
1 | import typer
2 | import numpy as np
3 | import pandas as pd
4 | from collections import Counter
5 | from sqlalchemy import create_engine
6 |
7 | from sklearn.model_selection import train_test_split
8 | from sklearn.linear_model import LogisticRegression
9 | from sklearn.preprocessing import RobustScaler
10 | from sklearn.preprocessing import OneHotEncoder
11 | from sklearn.impute import KNNImputer
12 | from sklearn.metrics import confusion_matrix
13 |
14 |
15 | class TitanicModelCreator:
16 | def __init__(self):
17 | pass
18 |
19 | def run(self):
20 | engine = create_engine('sqlite:///../data/titanic.db')
21 | sqlite_connection = engine.connect()
22 | pd.read_sql(
23 | 'SELECT * FROM sqlite_schema WHERE type="table"', con=sqlite_connection
24 | )
25 | np.random.seed(42)
26 |
27 | df = pd.read_sql('SELECT * FROM tbl_passengers', con=sqlite_connection)
28 |
29 | targets = pd.read_sql('SELECT * FROM tbl_targets', con=sqlite_connection)
30 |
31 | # df, targets = fetch_openml("titanic", version=1, as_frame=True, return_X_y=True)
32 |
33 | # parch = Parents/Children, sibsp = Siblings/Spouses
34 | df['family_size'] = df['parch'] + df['sibsp']
35 | df['is_alone'] = [
36 | 1 if family_size == 1 else 0 for family_size in df['family_size']
37 | ]
38 |
39 | df['title'] = [name.split(',')[1].split('.')[0].strip() for name in df['name']]
40 | rare_titles = {k for k, v in Counter(df['title']).items() if v < 10}
41 | df['title'] = [
42 | 'rare' if title in rare_titles else title for title in df['title']
43 | ]
44 |
45 | df = df[
46 | [
47 | 'pclass',
48 | 'sex',
49 | 'age',
50 | 'ticket',
51 | 'family_size',
52 | 'fare',
53 | 'embarked',
54 | 'is_alone',
55 | 'title',
56 | ]
57 | ]
58 |
59 | targets = [int(v) for v in targets['is_survived']]
60 | X_train, X_test, y_train, y_test = train_test_split(
61 | df, targets, stratify=targets, test_size=0.2
62 | )
63 |
64 | X_train_categorical = X_train[
65 | ['embarked', 'sex', 'pclass', 'title', 'is_alone']
66 | ]
67 | X_test_categorical = X_test[['embarked', 'sex', 'pclass', 'title', 'is_alone']]
68 |
69 | one_hot_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False).fit(
70 | X_train_categorical
71 | )
72 | X_train_categorical_one_hot = one_hot_encoder.transform(X_train_categorical)
73 | X_test_categorical_one_hot = one_hot_encoder.transform(X_test_categorical)
74 |
75 | X_train_numerical = X_train[['age', 'fare', 'family_size']]
76 | X_test_numerical = X_test[['age', 'fare', 'family_size']]
77 | knn_imputer = KNNImputer(n_neighbors=5).fit(X_train_numerical)
78 | X_train_numerical_imputed = knn_imputer.transform(X_train_numerical)
79 | X_test_numerical_imputed = knn_imputer.transform(X_test_numerical)
80 |
81 | robust_scaler = RobustScaler().fit(X_train_numerical_imputed)
82 | X_train_numerical_imputed_scaled = robust_scaler.transform(
83 | X_train_numerical_imputed
84 | )
85 | X_test_numerical_imputed_scaled = robust_scaler.transform(
86 | X_test_numerical_imputed
87 | )
88 |
89 | X_train_processed = np.hstack(
90 | (X_train_categorical_one_hot, X_train_numerical_imputed_scaled)
91 | )
92 | X_test_processed = np.hstack(
93 | (X_test_categorical_one_hot, X_test_numerical_imputed_scaled)
94 | )
95 |
96 | model = LogisticRegression(random_state=0).fit(X_train_processed, y_train)
97 | y_train_estimation = model.predict(X_train_processed)
98 | y_test_estimation = model.predict(X_test_processed)
99 |
100 | cm_train = confusion_matrix(y_train, y_train_estimation)
101 |
102 | cm_test = confusion_matrix(y_test, y_test_estimation)
103 |
104 | print('cm_train', cm_train)
105 | print('cm_test', cm_test)
106 |
107 |
108 | def main(param: str = 'pass'):
109 | titanic_model_creator = TitanicModelCreator()
110 | titanic_model_creator.run()
111 |
112 |
113 | if __name__ == "__main__":
114 | typer.run(main)
115 |
--------------------------------------------------------------------------------
/Step04/titanic_model.sh:
--------------------------------------------------------------------------------
1 | python3.8 ./titanic_model.py $1
2 |
--------------------------------------------------------------------------------
/Step05/README04.md:
--------------------------------------------------------------------------------
1 | ### Step 04: Move over the tests
2 |
3 | - Copy paste tests and testing code from the notebook in `Step00` into the `run()` function.
4 |
5 | This will implement very simple end-to-end testing which is less effort than unit testing given that the code is not really in a testable state. It caches the value of some variables and the next time you run the code it will compare it to this cache. If they match you didn't change the behaviour of the code with the last change. If your intentions was indeed to change the behaviour, verify from the output of the `AssertionError` that the changes are working as intended. If they are, delete the chaches and rerun the code to generate new reference values. The tests should be such that if they fail they produce meaningful differences. So instead of aggregate statistics (like an F1 score) test the datasets itself. That way even small changes won't go undetected. Once the code is refactored you can write different type of tests but that's a different story.
6 |
--------------------------------------------------------------------------------
/Step05/make_venv.sh:
--------------------------------------------------------------------------------
1 | python3.8 -m venv .venv
2 | source .venv/bin/activate
3 | pip3 install --upgrade pip
4 | pip3 install setuptools==57.1.0
5 | pip3 install wheel
6 | pip3 install -r requirements.txt
7 |
--------------------------------------------------------------------------------
/Step05/requirements.txt:
--------------------------------------------------------------------------------
1 | typer
2 | scikit-learn
3 | jupyter
4 | pandas
5 | numpy
6 | sqlalchemy
7 | pydantic
8 |
--------------------------------------------------------------------------------
/Step05/titanic_model.py:
--------------------------------------------------------------------------------
1 | import os
2 | import pickle
3 | import typer
4 | import numpy as np
5 | import pandas as pd
6 | from collections import Counter
7 | from sqlalchemy import create_engine
8 |
9 | from sklearn.model_selection import train_test_split
10 | from sklearn.linear_model import LogisticRegression
11 | from sklearn.preprocessing import RobustScaler
12 | from sklearn.preprocessing import OneHotEncoder
13 | from sklearn.impute import KNNImputer
14 | from sklearn.metrics import confusion_matrix
15 |
16 |
17 | def do_test(filename, data):
18 | if not os.path.isfile(filename):
19 | pickle.dump(data, open(filename, 'wb'))
20 | truth = pickle.load(open(filename, 'rb'))
21 | try:
22 | np.testing.assert_almost_equal(data, truth)
23 | print(f'{filename} test passed')
24 | except AssertionError as ex:
25 | print(f'{filename} test failed {ex}')
26 |
27 |
28 | def do_pandas_test(filename, data):
29 | if not os.path.isfile(filename):
30 | data.to_pickle(filename)
31 | truth = pd.read_pickle(filename)
32 | try:
33 | pd.testing.assert_frame_equal(data, truth)
34 | print(f'{filename} pandas test passed')
35 | except AssertionError as ex:
36 | print(f'{filename} pandas test failed {ex}')
37 |
38 |
39 | class TitanicModelCreator:
40 | def __init__(self):
41 | pass
42 |
43 | def run(self):
44 | engine = create_engine('sqlite:///../data/titanic.db')
45 | sqlite_connection = engine.connect()
46 | pd.read_sql(
47 | 'SELECT * FROM sqlite_schema WHERE type="table"', con=sqlite_connection
48 | )
49 | np.random.seed(42)
50 |
51 | df = pd.read_sql('SELECT * FROM tbl_passengers', con=sqlite_connection)
52 |
53 | targets = pd.read_sql('SELECT * FROM tbl_targets', con=sqlite_connection)
54 |
55 | # df, targets = fetch_openml("titanic", version=1, as_frame=True, return_X_y=True)
56 |
57 | # parch = Parents/Children, sibsp = Siblings/Spouses
58 | df['family_size'] = df['parch'] + df['sibsp']
59 | df['is_alone'] = [
60 | 1 if family_size == 1 else 0 for family_size in df['family_size']
61 | ]
62 |
63 | df['title'] = [name.split(',')[1].split('.')[0].strip() for name in df['name']]
64 | rare_titles = {k for k, v in Counter(df['title']).items() if v < 10}
65 | df['title'] = [
66 | 'rare' if title in rare_titles else title for title in df['title']
67 | ]
68 |
69 | df = df[
70 | [
71 | 'pclass',
72 | 'sex',
73 | 'age',
74 | 'ticket',
75 | 'family_size',
76 | 'fare',
77 | 'embarked',
78 | 'is_alone',
79 | 'title',
80 | ]
81 | ]
82 |
83 | targets = [int(v) for v in targets['is_survived']]
84 | X_train, X_test, y_train, y_test = train_test_split(
85 | df, targets, stratify=targets, test_size=0.2
86 | )
87 |
88 | X_train_categorical = X_train[
89 | ['embarked', 'sex', 'pclass', 'title', 'is_alone']
90 | ]
91 | X_test_categorical = X_test[['embarked', 'sex', 'pclass', 'title', 'is_alone']]
92 |
93 | one_hot_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False).fit(
94 | X_train_categorical
95 | )
96 | X_train_categorical_one_hot = one_hot_encoder.transform(X_train_categorical)
97 | X_test_categorical_one_hot = one_hot_encoder.transform(X_test_categorical)
98 |
99 | X_train_numerical = X_train[['age', 'fare', 'family_size']]
100 | X_test_numerical = X_test[['age', 'fare', 'family_size']]
101 | knn_imputer = KNNImputer(n_neighbors=5).fit(X_train_numerical)
102 | X_train_numerical_imputed = knn_imputer.transform(X_train_numerical)
103 | X_test_numerical_imputed = knn_imputer.transform(X_test_numerical)
104 |
105 | robust_scaler = RobustScaler().fit(X_train_numerical_imputed)
106 | X_train_numerical_imputed_scaled = robust_scaler.transform(
107 | X_train_numerical_imputed
108 | )
109 | X_test_numerical_imputed_scaled = robust_scaler.transform(
110 | X_test_numerical_imputed
111 | )
112 |
113 | X_train_processed = np.hstack(
114 | (X_train_categorical_one_hot, X_train_numerical_imputed_scaled)
115 | )
116 | X_test_processed = np.hstack(
117 | (X_test_categorical_one_hot, X_test_numerical_imputed_scaled)
118 | )
119 |
120 | model = LogisticRegression(random_state=0).fit(X_train_processed, y_train)
121 | y_train_estimation = model.predict(X_train_processed)
122 | y_test_estimation = model.predict(X_test_processed)
123 |
124 | cm_train = confusion_matrix(y_train, y_train_estimation)
125 |
126 | cm_test = confusion_matrix(y_test, y_test_estimation)
127 |
128 | print('cm_train', cm_train)
129 | print('cm_test', cm_test)
130 |
131 | do_test('../data/cm_test.pkl', cm_test)
132 | do_test('../data/cm_train.pkl', cm_train)
133 | do_test('../data/X_train_processed.pkl', X_train_processed)
134 | do_test('../data/X_test_processed.pkl', X_test_processed)
135 |
136 | do_pandas_test('../data/df.pkl', df)
137 |
138 |
139 | def main(param: str = 'pass'):
140 | titanic_model_creator = TitanicModelCreator()
141 | titanic_model_creator.run()
142 |
143 |
144 | if __name__ == "__main__":
145 | typer.run(main)
146 |
--------------------------------------------------------------------------------
/Step05/titanic_model.sh:
--------------------------------------------------------------------------------
1 | python3.8 ./titanic_model.py $1
2 |
--------------------------------------------------------------------------------
/Step06/README05.md:
--------------------------------------------------------------------------------
1 | ### Step 05: Decouple from the database
2 |
3 | - Write `SQLLoader` class
4 | - Move database related code into it
5 | - Replace database calls with interface calls in `run()`
6 |
7 | This is a typical example of the Adapter Pattern. Instead of directly calling the DB, we access it through an intermediary preparing for establishing "Loose Coupling" and "Dependency Inversion". In Clean Architecture the main code (the `run()` function) shouldn't know where the data is coming from, just what the data is. This will bring flexibility because this adapter can be replaced with another one that has the same interface but gets the data from a file. After that you can run your main code without a database which makes it more testable. More on this: [You only need 2 Design Patterns to improve the quality of your code in a data science project](https://laszlo.substack.com/p/you-only-need-2-design-patterns-to).
8 |
--------------------------------------------------------------------------------
/Step06/make_venv.sh:
--------------------------------------------------------------------------------
1 | python3.8 -m venv .venv
2 | source .venv/bin/activate
3 | pip3 install --upgrade pip
4 | pip3 install setuptools==57.1.0
5 | pip3 install wheel
6 | pip3 install -r requirements.txt
7 |
--------------------------------------------------------------------------------
/Step06/requirements.txt:
--------------------------------------------------------------------------------
1 | typer
2 | scikit-learn
3 | jupyter
4 | pandas
5 | numpy
6 | sqlalchemy
7 | pydantic
8 |
--------------------------------------------------------------------------------
/Step06/titanic_model.py:
--------------------------------------------------------------------------------
1 | import os
2 | import pickle
3 | import typer
4 | import numpy as np
5 | import pandas as pd
6 | from collections import Counter
7 | from sqlalchemy import create_engine
8 |
9 | from sklearn.model_selection import train_test_split
10 | from sklearn.linear_model import LogisticRegression
11 | from sklearn.preprocessing import RobustScaler
12 | from sklearn.preprocessing import OneHotEncoder
13 | from sklearn.impute import KNNImputer
14 | from sklearn.metrics import confusion_matrix
15 |
16 |
17 | def do_test(filename, data):
18 | if not os.path.isfile(filename):
19 | pickle.dump(data, open(filename, 'wb'))
20 | truth = pickle.load(open(filename, 'rb'))
21 | try:
22 | np.testing.assert_almost_equal(data, truth)
23 | print(f'{filename} test passed')
24 | except AssertionError as ex:
25 | print(f'{filename} test failed {ex}')
26 |
27 |
28 | def do_pandas_test(filename, data):
29 | if not os.path.isfile(filename):
30 | data.to_pickle(filename)
31 | truth = pd.read_pickle(filename)
32 | try:
33 | pd.testing.assert_frame_equal(data, truth)
34 | print(f'{filename} pandas test passed')
35 | except AssertionError as ex:
36 | print(f'{filename} pandas test failed {ex}')
37 |
38 |
39 | class SqlLoader:
40 | def __init__(self, connection_string):
41 | engine = create_engine(connection_string)
42 | self.connection = engine.connect()
43 |
44 | def get_passengers(self):
45 | query = 'SELECT * FROM tbl_passengers'
46 | return pd.read_sql(query, con=self.connection)
47 |
48 | def get_targets(self):
49 | query = 'SELECT * FROM tbl_targets'
50 | return pd.read_sql(query, con=self.connection)
51 |
52 |
53 | class TitanicModelCreator:
54 | def __init__(self):
55 | np.random.seed(42)
56 |
57 | def run(self):
58 | loader = SqlLoader(connection_string='sqlite:///../data/titanic.db')
59 |
60 | df = loader.get_passengers()
61 | targets = loader.get_targets()
62 |
63 | # parch = Parents/Children, sibsp = Siblings/Spouses
64 | df['family_size'] = df['parch'] + df['sibsp']
65 | df['is_alone'] = [
66 | 1 if family_size == 1 else 0 for family_size in df['family_size']
67 | ]
68 |
69 | df['title'] = [name.split(',')[1].split('.')[0].strip() for name in df['name']]
70 | rare_titles = {k for k, v in Counter(df['title']).items() if v < 10}
71 | df['title'] = [
72 | 'rare' if title in rare_titles else title for title in df['title']
73 | ]
74 |
75 | df = df[
76 | [
77 | 'pclass',
78 | 'sex',
79 | 'age',
80 | 'ticket',
81 | 'family_size',
82 | 'fare',
83 | 'embarked',
84 | 'is_alone',
85 | 'title',
86 | ]
87 | ]
88 |
89 | targets = [int(v) for v in targets['is_survived']]
90 | X_train, X_test, y_train, y_test = train_test_split(
91 | df, targets, stratify=targets, test_size=0.2
92 | )
93 |
94 | X_train_categorical = X_train[
95 | ['embarked', 'sex', 'pclass', 'title', 'is_alone']
96 | ]
97 | X_test_categorical = X_test[['embarked', 'sex', 'pclass', 'title', 'is_alone']]
98 |
99 | one_hot_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False).fit(
100 | X_train_categorical
101 | )
102 | X_train_categorical_one_hot = one_hot_encoder.transform(X_train_categorical)
103 | X_test_categorical_one_hot = one_hot_encoder.transform(X_test_categorical)
104 |
105 | X_train_numerical = X_train[['age', 'fare', 'family_size']]
106 | X_test_numerical = X_test[['age', 'fare', 'family_size']]
107 | knn_imputer = KNNImputer(n_neighbors=5).fit(X_train_numerical)
108 | X_train_numerical_imputed = knn_imputer.transform(X_train_numerical)
109 | X_test_numerical_imputed = knn_imputer.transform(X_test_numerical)
110 |
111 | robust_scaler = RobustScaler().fit(X_train_numerical_imputed)
112 | X_train_numerical_imputed_scaled = robust_scaler.transform(
113 | X_train_numerical_imputed
114 | )
115 | X_test_numerical_imputed_scaled = robust_scaler.transform(
116 | X_test_numerical_imputed
117 | )
118 |
119 | X_train_processed = np.hstack(
120 | (X_train_categorical_one_hot, X_train_numerical_imputed_scaled)
121 | )
122 | X_test_processed = np.hstack(
123 | (X_test_categorical_one_hot, X_test_numerical_imputed_scaled)
124 | )
125 |
126 | model = LogisticRegression(random_state=0).fit(X_train_processed, y_train)
127 | y_train_estimation = model.predict(X_train_processed)
128 | y_test_estimation = model.predict(X_test_processed)
129 |
130 | cm_train = confusion_matrix(y_train, y_train_estimation)
131 |
132 | cm_test = confusion_matrix(y_test, y_test_estimation)
133 |
134 | print('cm_train', cm_train)
135 | print('cm_test', cm_test)
136 |
137 | do_test('../data/cm_test.pkl', cm_test)
138 | do_test('../data/cm_train.pkl', cm_train)
139 | do_test('../data/X_train_processed.pkl', X_train_processed)
140 | do_test('../data/X_test_processed.pkl', X_test_processed)
141 |
142 | do_pandas_test('../data/df.pkl', df)
143 |
144 |
145 | def main(param: str = 'pass'):
146 | titanic_model_creator = TitanicModelCreator()
147 | titanic_model_creator.run()
148 |
149 |
150 | if __name__ == "__main__":
151 | typer.run(main)
152 |
--------------------------------------------------------------------------------
/Step06/titanic_model.sh:
--------------------------------------------------------------------------------
1 | python3.8 ./titanic_model.py $1
2 |
--------------------------------------------------------------------------------
/Step07/README06.md:
--------------------------------------------------------------------------------
1 | ### Step 06: Decouple from the database
2 |
3 | - Create loader propery and argument in `TitanicModelCreator.__init__()`
4 | - Remove the database loader instantiation from the `run()` function
5 | - Update `TitanicModelCreator` construction to create the loader there
6 |
7 | This will enable for the `TitanicModelCreator` to load data from any source for example files. Preparing to build a test context for rapid iteration. After you created the adapter class, this will do the decoupling. This is an example of "Dependency Injection", when a property of your main code is not written into the main body of the code but instead "plugged in" at constrcution time. The benefit of Dependency Injection is that you can change the behaviour of your code without rewriting it by purely changing its construction. As the saying goes: "Complex behaviour is constructed not written." Dependency Injection Principle is the `D` in the famed `SOLID` principles, and arguably the most important.
8 |
--------------------------------------------------------------------------------
/Step07/make_venv.sh:
--------------------------------------------------------------------------------
1 | python3.8 -m venv .venv
2 | source .venv/bin/activate
3 | pip3 install --upgrade pip
4 | pip3 install setuptools==57.1.0
5 | pip3 install wheel
6 | pip3 install -r requirements.txt
7 |
--------------------------------------------------------------------------------
/Step07/requirements.txt:
--------------------------------------------------------------------------------
1 | typer
2 | scikit-learn
3 | jupyter
4 | pandas
5 | numpy
6 | sqlalchemy
7 | pydantic
8 |
--------------------------------------------------------------------------------
/Step07/titanic_model.py:
--------------------------------------------------------------------------------
1 | import os
2 | import pickle
3 | import typer
4 | import numpy as np
5 | import pandas as pd
6 | from collections import Counter
7 | from sqlalchemy import create_engine
8 |
9 | from sklearn.model_selection import train_test_split
10 | from sklearn.linear_model import LogisticRegression
11 | from sklearn.preprocessing import RobustScaler
12 | from sklearn.preprocessing import OneHotEncoder
13 | from sklearn.impute import KNNImputer
14 | from sklearn.metrics import confusion_matrix
15 |
16 |
17 | def do_test(filename, data):
18 | if not os.path.isfile(filename):
19 | pickle.dump(data, open(filename, 'wb'))
20 | truth = pickle.load(open(filename, 'rb'))
21 | try:
22 | np.testing.assert_almost_equal(data, truth)
23 | print(f'{filename} test passed')
24 | except AssertionError as ex:
25 | print(f'{filename} test failed {ex}')
26 |
27 |
28 | def do_pandas_test(filename, data):
29 | if not os.path.isfile(filename):
30 | data.to_pickle(filename)
31 | truth = pd.read_pickle(filename)
32 | try:
33 | pd.testing.assert_frame_equal(data, truth)
34 | print(f'{filename} pandas test passed')
35 | except AssertionError as ex:
36 | print(f'{filename} pandas test failed {ex}')
37 |
38 |
39 | class SqlLoader:
40 | def __init__(self, connection_string):
41 | engine = create_engine(connection_string)
42 | self.connection = engine.connect()
43 |
44 | def get_passengers(self):
45 | query = 'SELECT * FROM tbl_passengers'
46 | return pd.read_sql(query, con=self.connection)
47 |
48 | def get_targets(self):
49 | query = 'SELECT * FROM tbl_targets'
50 | return pd.read_sql(query, con=self.connection)
51 |
52 |
53 | class TitanicModelCreator:
54 | def __init__(self, loader):
55 | self.loader = loader
56 | np.random.seed(42)
57 |
58 | def run(self):
59 | df = self.loader.get_passengers()
60 | targets = self.loader.get_targets()
61 |
62 | # parch = Parents/Children, sibsp = Siblings/Spouses
63 | df['family_size'] = df['parch'] + df['sibsp']
64 | df['is_alone'] = [
65 | 1 if family_size == 1 else 0 for family_size in df['family_size']
66 | ]
67 |
68 | df['title'] = [name.split(',')[1].split('.')[0].strip() for name in df['name']]
69 | rare_titles = {k for k, v in Counter(df['title']).items() if v < 10}
70 | df['title'] = [
71 | 'rare' if title in rare_titles else title for title in df['title']
72 | ]
73 |
74 | df = df[
75 | [
76 | 'pclass',
77 | 'sex',
78 | 'age',
79 | 'ticket',
80 | 'family_size',
81 | 'fare',
82 | 'embarked',
83 | 'is_alone',
84 | 'title',
85 | ]
86 | ]
87 |
88 | targets = [int(v) for v in targets['is_survived']]
89 | X_train, X_test, y_train, y_test = train_test_split(
90 | df, targets, stratify=targets, test_size=0.2
91 | )
92 |
93 | X_train_categorical = X_train[
94 | ['embarked', 'sex', 'pclass', 'title', 'is_alone']
95 | ]
96 | X_test_categorical = X_test[['embarked', 'sex', 'pclass', 'title', 'is_alone']]
97 |
98 | one_hot_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False).fit(
99 | X_train_categorical
100 | )
101 | X_train_categorical_one_hot = one_hot_encoder.transform(X_train_categorical)
102 | X_test_categorical_one_hot = one_hot_encoder.transform(X_test_categorical)
103 |
104 | X_train_numerical = X_train[['age', 'fare', 'family_size']]
105 | X_test_numerical = X_test[['age', 'fare', 'family_size']]
106 | knn_imputer = KNNImputer(n_neighbors=5).fit(X_train_numerical)
107 | X_train_numerical_imputed = knn_imputer.transform(X_train_numerical)
108 | X_test_numerical_imputed = knn_imputer.transform(X_test_numerical)
109 |
110 | robust_scaler = RobustScaler().fit(X_train_numerical_imputed)
111 | X_train_numerical_imputed_scaled = robust_scaler.transform(
112 | X_train_numerical_imputed
113 | )
114 | X_test_numerical_imputed_scaled = robust_scaler.transform(
115 | X_test_numerical_imputed
116 | )
117 |
118 | X_train_processed = np.hstack(
119 | (X_train_categorical_one_hot, X_train_numerical_imputed_scaled)
120 | )
121 | X_test_processed = np.hstack(
122 | (X_test_categorical_one_hot, X_test_numerical_imputed_scaled)
123 | )
124 |
125 | model = LogisticRegression(random_state=0).fit(X_train_processed, y_train)
126 | y_train_estimation = model.predict(X_train_processed)
127 | y_test_estimation = model.predict(X_test_processed)
128 |
129 | cm_train = confusion_matrix(y_train, y_train_estimation)
130 |
131 | cm_test = confusion_matrix(y_test, y_test_estimation)
132 |
133 | print('cm_train', cm_train)
134 | print('cm_test', cm_test)
135 |
136 | do_test('../data/cm_test.pkl', cm_test)
137 | do_test('../data/cm_train.pkl', cm_train)
138 | do_test('../data/X_train_processed.pkl', X_train_processed)
139 | do_test('../data/X_test_processed.pkl', X_test_processed)
140 |
141 | do_pandas_test('../data/df.pkl', df)
142 |
143 |
144 | def main(param: str = 'pass'):
145 | titanic_model_creator = TitanicModelCreator(
146 | loader=SqlLoader(connection_string='sqlite:///../data/titanic.db')
147 | )
148 | titanic_model_creator.run()
149 |
150 |
151 | if __name__ == "__main__":
152 | typer.run(main)
153 |
--------------------------------------------------------------------------------
/Step07/titanic_model.sh:
--------------------------------------------------------------------------------
1 | python3.8 ./titanic_model.py $1
2 |
--------------------------------------------------------------------------------
/Step08/README07.md:
--------------------------------------------------------------------------------
1 | ### Step 07: Write testing dataloader
2 |
3 | - Write a class that loads the required data from files
4 | - Same interface as `SqlLoader`
5 | - Add a "real" loader to it as a property
6 |
7 | This will allow the test context to work without DB connection and still have the DB as a fallback when you run it for the first time. For `TitanicModelCreator` the two loaders are indistinguishable as they have the same interface.
8 |
--------------------------------------------------------------------------------
/Step08/make_venv.sh:
--------------------------------------------------------------------------------
1 | python3.8 -m venv .venv
2 | source .venv/bin/activate
3 | pip3 install --upgrade pip
4 | pip3 install setuptools==57.1.0
5 | pip3 install wheel
6 | pip3 install -r requirements.txt
7 |
--------------------------------------------------------------------------------
/Step08/requirements.txt:
--------------------------------------------------------------------------------
1 | typer
2 | scikit-learn
3 | jupyter
4 | pandas
5 | numpy
6 | sqlalchemy
7 | pydantic
8 |
--------------------------------------------------------------------------------
/Step08/titanic_model.py:
--------------------------------------------------------------------------------
1 | import os
2 | import pickle
3 | import typer
4 | import numpy as np
5 | import pandas as pd
6 | from collections import Counter
7 | from sqlalchemy import create_engine
8 |
9 | from sklearn.model_selection import train_test_split
10 | from sklearn.linear_model import LogisticRegression
11 | from sklearn.preprocessing import RobustScaler
12 | from sklearn.preprocessing import OneHotEncoder
13 | from sklearn.impute import KNNImputer
14 | from sklearn.metrics import confusion_matrix
15 |
16 |
17 | def do_test(filename, data):
18 | if not os.path.isfile(filename):
19 | pickle.dump(data, open(filename, 'wb'))
20 | truth = pickle.load(open(filename, 'rb'))
21 | try:
22 | np.testing.assert_almost_equal(data, truth)
23 | print(f'{filename} test passed')
24 | except AssertionError as ex:
25 | print(f'{filename} test failed {ex}')
26 |
27 |
28 | def do_pandas_test(filename, data):
29 | if not os.path.isfile(filename):
30 | data.to_pickle(filename)
31 | truth = pd.read_pickle(filename)
32 | try:
33 | pd.testing.assert_frame_equal(data, truth)
34 | print(f'{filename} pandas test passed')
35 | except AssertionError as ex:
36 | print(f'{filename} pandas test failed {ex}')
37 |
38 |
39 | class SqlLoader:
40 | def __init__(self, connection_string):
41 | engine = create_engine(connection_string)
42 | self.connection = engine.connect()
43 |
44 | def get_passengers(self):
45 | query = 'SELECT * FROM tbl_passengers'
46 | return pd.read_sql(query, con=self.connection)
47 |
48 | def get_targets(self):
49 | query = 'SELECT * FROM tbl_targets'
50 | return pd.read_sql(query, con=self.connection)
51 |
52 |
53 | class TestLoader:
54 | def __init__(self, passengers_filename, targets_filename, real_loader):
55 | self.passengers_filename = passengers_filename
56 | self.targets_filename = targets_filename
57 | self.real_loader = real_loader
58 | if not os.path.isfile(self.passengers_filename):
59 | df = self.real_loader.get_passengers()
60 | df.to_pickle(self.passengers_filename)
61 | if not os.path.isfile(self.targets_filename):
62 | df = self.real_loader.get_targets()
63 | df.to_pickle(self.targets_filename)
64 |
65 | def get_passengers(self):
66 | return pd.read_pickle(self.passengers_filename)
67 |
68 | def get_targets(self):
69 | return pd.read_pickle(self.targets_filename)
70 |
71 |
72 | class TitanicModelCreator:
73 | def __init__(self, loader):
74 | self.loader = loader
75 | np.random.seed(42)
76 |
77 | def run(self):
78 | df = self.loader.get_passengers()
79 | targets = self.loader.get_targets()
80 |
81 | # parch = Parents/Children, sibsp = Siblings/Spouses
82 | df['family_size'] = df['parch'] + df['sibsp']
83 | df['is_alone'] = [
84 | 1 if family_size == 1 else 0 for family_size in df['family_size']
85 | ]
86 |
87 | df['title'] = [name.split(',')[1].split('.')[0].strip() for name in df['name']]
88 | rare_titles = {k for k, v in Counter(df['title']).items() if v < 10}
89 | df['title'] = [
90 | 'rare' if title in rare_titles else title for title in df['title']
91 | ]
92 |
93 | df = df[
94 | [
95 | 'pclass',
96 | 'sex',
97 | 'age',
98 | 'ticket',
99 | 'family_size',
100 | 'fare',
101 | 'embarked',
102 | 'is_alone',
103 | 'title',
104 | ]
105 | ]
106 |
107 | targets = [int(v) for v in targets['is_survived']]
108 | X_train, X_test, y_train, y_test = train_test_split(
109 | df, targets, stratify=targets, test_size=0.2
110 | )
111 |
112 | X_train_categorical = X_train[
113 | ['embarked', 'sex', 'pclass', 'title', 'is_alone']
114 | ]
115 | X_test_categorical = X_test[['embarked', 'sex', 'pclass', 'title', 'is_alone']]
116 |
117 | one_hot_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False).fit(
118 | X_train_categorical
119 | )
120 | X_train_categorical_one_hot = one_hot_encoder.transform(X_train_categorical)
121 | X_test_categorical_one_hot = one_hot_encoder.transform(X_test_categorical)
122 |
123 | X_train_numerical = X_train[['age', 'fare', 'family_size']]
124 | X_test_numerical = X_test[['age', 'fare', 'family_size']]
125 | knn_imputer = KNNImputer(n_neighbors=5).fit(X_train_numerical)
126 | X_train_numerical_imputed = knn_imputer.transform(X_train_numerical)
127 | X_test_numerical_imputed = knn_imputer.transform(X_test_numerical)
128 |
129 | robust_scaler = RobustScaler().fit(X_train_numerical_imputed)
130 | X_train_numerical_imputed_scaled = robust_scaler.transform(
131 | X_train_numerical_imputed
132 | )
133 | X_test_numerical_imputed_scaled = robust_scaler.transform(
134 | X_test_numerical_imputed
135 | )
136 |
137 | X_train_processed = np.hstack(
138 | (X_train_categorical_one_hot, X_train_numerical_imputed_scaled)
139 | )
140 | X_test_processed = np.hstack(
141 | (X_test_categorical_one_hot, X_test_numerical_imputed_scaled)
142 | )
143 |
144 | model = LogisticRegression(random_state=0).fit(X_train_processed, y_train)
145 | y_train_estimation = model.predict(X_train_processed)
146 | y_test_estimation = model.predict(X_test_processed)
147 |
148 | cm_train = confusion_matrix(y_train, y_train_estimation)
149 |
150 | cm_test = confusion_matrix(y_test, y_test_estimation)
151 |
152 | print('cm_train', cm_train)
153 | print('cm_test', cm_test)
154 |
155 | do_test('../data/cm_test.pkl', cm_test)
156 | do_test('../data/cm_train.pkl', cm_train)
157 | do_test('../data/X_train_processed.pkl', X_train_processed)
158 | do_test('../data/X_test_processed.pkl', X_test_processed)
159 |
160 | do_pandas_test('../data/df.pkl', df)
161 |
162 |
163 | def main(param: str = 'pass'):
164 | titanic_model_creator = TitanicModelCreator(
165 | loader=SqlLoader(connection_string='sqlite:///../data/titanic.db')
166 | )
167 | titanic_model_creator.run()
168 |
169 |
170 | if __name__ == "__main__":
171 | typer.run(main)
172 |
--------------------------------------------------------------------------------
/Step08/titanic_model.sh:
--------------------------------------------------------------------------------
1 | python3.8 ./titanic_model.py $1
2 |
--------------------------------------------------------------------------------
/Step09/README08.md:
--------------------------------------------------------------------------------
1 | ### Step 08: Write the test context
2 |
3 | - Create `test_main` function
4 | - Make sure `typer` calls that in `typer.run()`
5 | - Copy the code from `main` to `test_main`
6 | - Replace `SqlLoader` in it with `TestLoader`
7 |
8 | From now on this is the only code that is tested. The costly connection to the DB is replaced with a file load. Also if it is still not fast enough, additional parameter can reduce the amount of data in the test to make the process faster. [How can a Data Scientist refactor Jupyter notebooks towards production-quality code?](https://laszlo.substack.com/p/how-can-a-data-scientist-refactor) [I appreciate this might be terse. Comment, open an issue, vote on it if you would like to have a detailed discussion on this - Laszlo]
9 |
10 | This is the essence of the importance of Clean Architecture and code reuse. Every code will be used in two different context: test and "production" by injecting different dependencies. Because the same code runs in both places there is no time spent on translating from one to another. The test setup should reflect production context as close as possible so when a test fail or pass you can think that the same will happen in production as well. This speed up iteration because you can freely experiment in the test context and only deploy code into "production" when you are convinced it is doing what you think it should do. But it is the same code, so deployment is effortless.
11 |
--------------------------------------------------------------------------------
/Step09/make_venv.sh:
--------------------------------------------------------------------------------
1 | python3.8 -m venv .venv
2 | source .venv/bin/activate
3 | pip3 install --upgrade pip
4 | pip3 install setuptools==57.1.0
5 | pip3 install wheel
6 | pip3 install -r requirements.txt
7 |
--------------------------------------------------------------------------------
/Step09/requirements.txt:
--------------------------------------------------------------------------------
1 | typer
2 | scikit-learn
3 | jupyter
4 | pandas
5 | numpy
6 | sqlalchemy
7 | pydantic
8 |
--------------------------------------------------------------------------------
/Step09/titanic_model.py:
--------------------------------------------------------------------------------
1 | import os
2 | import pickle
3 | import typer
4 | import numpy as np
5 | import pandas as pd
6 | from collections import Counter
7 | from sqlalchemy import create_engine
8 |
9 | from sklearn.model_selection import train_test_split
10 | from sklearn.linear_model import LogisticRegression
11 | from sklearn.preprocessing import RobustScaler
12 | from sklearn.preprocessing import OneHotEncoder
13 | from sklearn.impute import KNNImputer
14 | from sklearn.metrics import confusion_matrix
15 |
16 |
17 | def do_test(filename, data):
18 | if not os.path.isfile(filename):
19 | pickle.dump(data, open(filename, 'wb'))
20 | truth = pickle.load(open(filename, 'rb'))
21 | try:
22 | np.testing.assert_almost_equal(data, truth)
23 | print(f'{filename} test passed')
24 | except AssertionError as ex:
25 | print(f'{filename} test failed {ex}')
26 |
27 |
28 | def do_pandas_test(filename, data):
29 | if not os.path.isfile(filename):
30 | data.to_pickle(filename)
31 | truth = pd.read_pickle(filename)
32 | try:
33 | pd.testing.assert_frame_equal(data, truth)
34 | print(f'{filename} pandas test passed')
35 | except AssertionError as ex:
36 | print(f'{filename} pandas test failed {ex}')
37 |
38 |
39 | class SqlLoader:
40 | def __init__(self, connection_string):
41 | engine = create_engine(connection_string)
42 | self.connection = engine.connect()
43 |
44 | def get_passengers(self):
45 | query = 'SELECT * FROM tbl_passengers'
46 | return pd.read_sql(query, con=self.connection)
47 |
48 | def get_targets(self):
49 | query = 'SELECT * FROM tbl_targets'
50 | return pd.read_sql(query, con=self.connection)
51 |
52 |
53 | class TestLoader:
54 | def __init__(self, passengers_filename, targets_filename, real_loader):
55 | self.passengers_filename = passengers_filename
56 | self.targets_filename = targets_filename
57 | self.real_loader = real_loader
58 | if not os.path.isfile(self.passengers_filename):
59 | df = self.real_loader.get_passengers()
60 | df.to_pickle(self.passengers_filename)
61 | if not os.path.isfile(self.targets_filename):
62 | df = self.real_loader.get_targets()
63 | df.to_pickle(self.targets_filename)
64 |
65 | def get_passengers(self):
66 | return pd.read_pickle(self.passengers_filename)
67 |
68 | def get_targets(self):
69 | return pd.read_pickle(self.targets_filename)
70 |
71 |
72 | class TitanicModelCreator:
73 | def __init__(self, loader):
74 | self.loader = loader
75 | np.random.seed(42)
76 |
77 | def run(self):
78 | df = self.loader.get_passengers()
79 | targets = self.loader.get_targets()
80 |
81 | # parch = Parents/Children, sibsp = Siblings/Spouses
82 | df['family_size'] = df['parch'] + df['sibsp']
83 | df['is_alone'] = [
84 | 1 if family_size == 1 else 0 for family_size in df['family_size']
85 | ]
86 |
87 | df['title'] = [name.split(',')[1].split('.')[0].strip() for name in df['name']]
88 | rare_titles = {k for k, v in Counter(df['title']).items() if v < 10}
89 | df['title'] = [
90 | 'rare' if title in rare_titles else title for title in df['title']
91 | ]
92 |
93 | df = df[
94 | [
95 | 'pclass',
96 | 'sex',
97 | 'age',
98 | 'ticket',
99 | 'family_size',
100 | 'fare',
101 | 'embarked',
102 | 'is_alone',
103 | 'title',
104 | ]
105 | ]
106 |
107 | targets = [int(v) for v in targets['is_survived']]
108 | X_train, X_test, y_train, y_test = train_test_split(
109 | df, targets, stratify=targets, test_size=0.2
110 | )
111 |
112 | X_train_categorical = X_train[
113 | ['embarked', 'sex', 'pclass', 'title', 'is_alone']
114 | ]
115 | X_test_categorical = X_test[['embarked', 'sex', 'pclass', 'title', 'is_alone']]
116 |
117 | one_hot_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False).fit(
118 | X_train_categorical
119 | )
120 | X_train_categorical_one_hot = one_hot_encoder.transform(X_train_categorical)
121 | X_test_categorical_one_hot = one_hot_encoder.transform(X_test_categorical)
122 |
123 | X_train_numerical = X_train[['age', 'fare', 'family_size']]
124 | X_test_numerical = X_test[['age', 'fare', 'family_size']]
125 | knn_imputer = KNNImputer(n_neighbors=5).fit(X_train_numerical)
126 | X_train_numerical_imputed = knn_imputer.transform(X_train_numerical)
127 | X_test_numerical_imputed = knn_imputer.transform(X_test_numerical)
128 |
129 | robust_scaler = RobustScaler().fit(X_train_numerical_imputed)
130 | X_train_numerical_imputed_scaled = robust_scaler.transform(
131 | X_train_numerical_imputed
132 | )
133 | X_test_numerical_imputed_scaled = robust_scaler.transform(
134 | X_test_numerical_imputed
135 | )
136 |
137 | X_train_processed = np.hstack(
138 | (X_train_categorical_one_hot, X_train_numerical_imputed_scaled)
139 | )
140 | X_test_processed = np.hstack(
141 | (X_test_categorical_one_hot, X_test_numerical_imputed_scaled)
142 | )
143 |
144 | model = LogisticRegression(random_state=0).fit(X_train_processed, y_train)
145 | y_train_estimation = model.predict(X_train_processed)
146 | y_test_estimation = model.predict(X_test_processed)
147 |
148 | cm_train = confusion_matrix(y_train, y_train_estimation)
149 |
150 | cm_test = confusion_matrix(y_test, y_test_estimation)
151 |
152 | print('cm_train', cm_train)
153 | print('cm_test', cm_test)
154 |
155 | do_test('../data/cm_test.pkl', cm_test)
156 | do_test('../data/cm_train.pkl', cm_train)
157 | do_test('../data/X_train_processed.pkl', X_train_processed)
158 | do_test('../data/X_test_processed.pkl', X_test_processed)
159 |
160 | do_pandas_test('../data/df.pkl', df)
161 |
162 |
163 | def main(param: str = 'pass'):
164 | titanic_model_creator = TitanicModelCreator(
165 | loader=SqlLoader(connection_string='sqlite:///../data/titanic.db')
166 | )
167 | titanic_model_creator.run()
168 |
169 |
170 | def test_main(param: str = 'pass'):
171 | titanic_model_creator = TitanicModelCreator(
172 | loader=TestLoader(
173 | passengers_filename='../data/passengers.pkl',
174 | targets_filename='../data/targets.pkl',
175 | real_loader=SqlLoader(connection_string='sqlite:///../data/titanic.db'),
176 | )
177 | )
178 | titanic_model_creator.run()
179 |
180 |
181 | if __name__ == "__main__":
182 | typer.run(test_main)
183 |
--------------------------------------------------------------------------------
/Step09/titanic_model.sh:
--------------------------------------------------------------------------------
1 | python3.8 ./titanic_model.py $1
2 |
--------------------------------------------------------------------------------
/Step10/README09.md:
--------------------------------------------------------------------------------
1 | ### Step 09: Merge passenger data with targets
2 |
3 | - Remove the `get_targets()` interface
4 | - Replace the query in `SqlLoader`
5 | - Remove any code related to `targets`
6 |
7 | This is a step to prepare to build the "domain data model". The Titanic model is about survival of her passengers. For the code to align this domain the concept of "passengers" need to be introduced (as a class/object). A passenger either survived or not, it's an attribute of the passenger and it need to be implemented like that.
8 |
9 | This is a critical part of the code quality journey and building better systems. Once you introduce these concepts your code will depend directly on the business problem you are solving not the various representations the data is stored (pandas, numpy, csv, etc). I wrote about this many times on my blog:
10 |
11 | - [3 Ways Domain Data Models help Data Science Projects](https://laszlo.substack.com/p/3-ways-domain-data-models-help-data)
12 | - [Clean Architecture: How to structure your ML projects to reduce technical debt](https://laszlo.substack.com/p/slides-for-my-talk-at-pydata-london)
13 | - [How did I change my mind about dataclasses in ML projects?](https://laszlo.substack.com/p/how-did-i-change-my-mind-about-dataclasses)
14 |
--------------------------------------------------------------------------------
/Step10/make_venv.sh:
--------------------------------------------------------------------------------
1 | python3.8 -m venv .venv
2 | source .venv/bin/activate
3 | pip3 install --upgrade pip
4 | pip3 install setuptools==57.1.0
5 | pip3 install wheel
6 | pip3 install -r requirements.txt
7 |
--------------------------------------------------------------------------------
/Step10/requirements.txt:
--------------------------------------------------------------------------------
1 | typer
2 | scikit-learn
3 | jupyter
4 | pandas
5 | numpy
6 | sqlalchemy
7 | pydantic
8 |
--------------------------------------------------------------------------------
/Step10/titanic_model.py:
--------------------------------------------------------------------------------
1 | import os
2 | import pickle
3 | import typer
4 | import numpy as np
5 | import pandas as pd
6 | from collections import Counter
7 | from sqlalchemy import create_engine
8 |
9 | from sklearn.model_selection import train_test_split
10 | from sklearn.linear_model import LogisticRegression
11 | from sklearn.preprocessing import RobustScaler
12 | from sklearn.preprocessing import OneHotEncoder
13 | from sklearn.impute import KNNImputer
14 | from sklearn.metrics import confusion_matrix
15 |
16 |
17 | def do_test(filename, data):
18 | if not os.path.isfile(filename):
19 | pickle.dump(data, open(filename, 'wb'))
20 | truth = pickle.load(open(filename, 'rb'))
21 | try:
22 | np.testing.assert_almost_equal(data, truth)
23 | print(f'{filename} test passed')
24 | except AssertionError as ex:
25 | print(f'{filename} test failed {ex}')
26 |
27 |
28 | def do_pandas_test(filename, data):
29 | if not os.path.isfile(filename):
30 | data.to_pickle(filename)
31 | truth = pd.read_pickle(filename)
32 | try:
33 | pd.testing.assert_frame_equal(data, truth)
34 | print(f'{filename} pandas test passed')
35 | except AssertionError as ex:
36 | print(f'{filename} pandas test failed {ex}')
37 |
38 |
39 | class SqlLoader:
40 | def __init__(self, connection_string):
41 | engine = create_engine(connection_string)
42 | self.connection = engine.connect()
43 |
44 | def get_passengers(self):
45 | query = """
46 | SELECT
47 | tbl_passengers.*,
48 | tbl_targets.is_survived
49 | FROM
50 | tbl_passengers
51 | JOIN
52 | tbl_targets
53 | ON
54 | tbl_passengers.pid=tbl_targets.pid
55 | """
56 | return pd.read_sql(query, con=self.connection)
57 |
58 |
59 | class TestLoader:
60 | def __init__(self, passengers_filename, real_loader):
61 | self.passengers_filename = passengers_filename
62 | self.real_loader = real_loader
63 | if not os.path.isfile(self.passengers_filename):
64 | df = self.real_loader.get_passengers()
65 | df.to_pickle(self.passengers_filename)
66 |
67 | def get_passengers(self):
68 | return pd.read_pickle(self.passengers_filename)
69 |
70 |
71 | class TitanicModelCreator:
72 | def __init__(self, loader):
73 | self.loader = loader
74 | np.random.seed(42)
75 |
76 | def run(self):
77 | df = self.loader.get_passengers()
78 |
79 | # parch = Parents/Children, sibsp = Siblings/Spouses
80 | df['family_size'] = df['parch'] + df['sibsp']
81 | df['is_alone'] = [
82 | 1 if family_size == 1 else 0 for family_size in df['family_size']
83 | ]
84 |
85 | df['title'] = [name.split(',')[1].split('.')[0].strip() for name in df['name']]
86 | rare_titles = {k for k, v in Counter(df['title']).items() if v < 10}
87 | df['title'] = [
88 | 'rare' if title in rare_titles else title for title in df['title']
89 | ]
90 |
91 | targets = [int(v) for v in df['is_survived']]
92 | df = df[
93 | [
94 | 'pclass',
95 | 'sex',
96 | 'age',
97 | 'ticket',
98 | 'family_size',
99 | 'fare',
100 | 'embarked',
101 | 'is_alone',
102 | 'title',
103 | ]
104 | ]
105 |
106 | X_train, X_test, y_train, y_test = train_test_split(
107 | df, targets, stratify=targets, test_size=0.2
108 | )
109 |
110 | X_train_categorical = X_train[
111 | ['embarked', 'sex', 'pclass', 'title', 'is_alone']
112 | ]
113 | X_test_categorical = X_test[['embarked', 'sex', 'pclass', 'title', 'is_alone']]
114 |
115 | one_hot_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False).fit(
116 | X_train_categorical
117 | )
118 | X_train_categorical_one_hot = one_hot_encoder.transform(X_train_categorical)
119 | X_test_categorical_one_hot = one_hot_encoder.transform(X_test_categorical)
120 |
121 | X_train_numerical = X_train[['age', 'fare', 'family_size']]
122 | X_test_numerical = X_test[['age', 'fare', 'family_size']]
123 | knn_imputer = KNNImputer(n_neighbors=5).fit(X_train_numerical)
124 | X_train_numerical_imputed = knn_imputer.transform(X_train_numerical)
125 | X_test_numerical_imputed = knn_imputer.transform(X_test_numerical)
126 |
127 | robust_scaler = RobustScaler().fit(X_train_numerical_imputed)
128 | X_train_numerical_imputed_scaled = robust_scaler.transform(
129 | X_train_numerical_imputed
130 | )
131 | X_test_numerical_imputed_scaled = robust_scaler.transform(
132 | X_test_numerical_imputed
133 | )
134 |
135 | X_train_processed = np.hstack(
136 | (X_train_categorical_one_hot, X_train_numerical_imputed_scaled)
137 | )
138 | X_test_processed = np.hstack(
139 | (X_test_categorical_one_hot, X_test_numerical_imputed_scaled)
140 | )
141 |
142 | model = LogisticRegression(random_state=0).fit(X_train_processed, y_train)
143 | y_train_estimation = model.predict(X_train_processed)
144 | y_test_estimation = model.predict(X_test_processed)
145 |
146 | cm_train = confusion_matrix(y_train, y_train_estimation)
147 |
148 | cm_test = confusion_matrix(y_test, y_test_estimation)
149 |
150 | print('cm_train', cm_train)
151 | print('cm_test', cm_test)
152 |
153 | do_test('../data/cm_test.pkl', cm_test)
154 | do_test('../data/cm_train.pkl', cm_train)
155 | do_test('../data/X_train_processed.pkl', X_train_processed)
156 | do_test('../data/X_test_processed.pkl', X_test_processed)
157 |
158 | do_pandas_test('../data/df.pkl', df)
159 |
160 |
161 | def main(param: str = 'pass'):
162 | titanic_model_creator = TitanicModelCreator(
163 | loader=SqlLoader(connection_string='sqlite:///../data/titanic.db')
164 | )
165 | titanic_model_creator.run()
166 |
167 |
168 | def test_main(param: str = 'pass'):
169 | titanic_model_creator = TitanicModelCreator(
170 | loader=TestLoader(
171 | passengers_filename='../data/passengers_with_is_survived.pkl',
172 | real_loader=SqlLoader(connection_string='sqlite:///../data/titanic.db'),
173 | )
174 | )
175 | titanic_model_creator.run()
176 |
177 |
178 | if __name__ == "__main__":
179 | typer.run(test_main)
180 |
--------------------------------------------------------------------------------
/Step10/titanic_model.sh:
--------------------------------------------------------------------------------
1 | python3.8 ./titanic_model.py $1
2 |
--------------------------------------------------------------------------------
/Step11/README10.md:
--------------------------------------------------------------------------------
1 | ### Step 10: Create Passenger class
2 |
3 | - Import `BaseModel` from `pydantic`
4 | - Create the class by inspecting:
5 | - The `dtype` of columns used in `df`
6 | - The actual values in `df`
7 | - The names of the columns that are used later in the code
8 |
9 | There is really no shortcut here. In a "real" project defining this class would be the first step, but in legacy you need to deal with it later. The benefit of domain data objects is that any time you use them you can assume they fulfill a set of assumptions. These can be made explicit with `pydantic's` validators. One goal of the refactoring is to make sure that most interaction between classes happen with domain data objects. This simplifies structuring the project, any future data related change has a well defined place to happen.
10 |
--------------------------------------------------------------------------------
/Step11/make_venv.sh:
--------------------------------------------------------------------------------
1 | python3.8 -m venv .venv
2 | source .venv/bin/activate
3 | pip3 install --upgrade pip
4 | pip3 install setuptools==57.1.0
5 | pip3 install wheel
6 | pip3 install -r requirements.txt
7 |
--------------------------------------------------------------------------------
/Step11/requirements.txt:
--------------------------------------------------------------------------------
1 | typer
2 | scikit-learn
3 | jupyter
4 | pandas
5 | numpy
6 | sqlalchemy
7 | pydantic
8 |
--------------------------------------------------------------------------------
/Step11/titanic_model.py:
--------------------------------------------------------------------------------
1 | import os
2 | import pickle
3 | import typer
4 | import numpy as np
5 | import pandas as pd
6 | from pydantic import BaseModel
7 | from collections import Counter
8 | from sqlalchemy import create_engine
9 |
10 | from sklearn.model_selection import train_test_split
11 | from sklearn.linear_model import LogisticRegression
12 | from sklearn.preprocessing import RobustScaler
13 | from sklearn.preprocessing import OneHotEncoder
14 | from sklearn.impute import KNNImputer
15 | from sklearn.metrics import confusion_matrix
16 |
17 |
18 | class Passenger(BaseModel):
19 | pid: int
20 | pclass: int
21 | sex: str
22 | age: float
23 | ticket: str
24 | family_size: int
25 | fare: float
26 | embarked: str
27 | is_alone: int
28 | title: str
29 | is_survived: int
30 |
31 |
32 | # targets = [int(v) for v in df['is_survived']]
33 | # df = df[[
34 | # 'pclass', 'sex', 'age', 'ticket', 'family_size',
35 | # 'fare', 'embarked', 'is_alone', 'title',
36 | # ]]
37 |
38 | # >>> df[:3].T
39 | # 0 1 2
40 | # pid 0 1 2
41 | # pclass 1.0 1.0 1.0
42 | # name Allen, Miss. Elisabeth Walton Allison, Master. Hudson Trevor Allison, Miss. Helen Loraine
43 | # sex female male female
44 | # age 29.0 0.9167 2.0
45 | # sibsp 0.0 1.0 1.0
46 | # parch 0.0 2.0 2.0
47 | # ticket 24160 113781 113781
48 | # fare 211.3375 151.55 151.55
49 | # cabin B5 C22 C26 C22 C26
50 | # embarked S S S
51 | # boat 2 11 None
52 | # body NaN NaN NaN
53 | # home.dest St Louis, MO Montreal, PQ / Chesterville, ON Montreal, PQ / Chesterville, ON
54 | # is_survived 1 1 0
55 | # >>> df.dtypes
56 | # pid int64
57 | # pclass float64
58 | # name object
59 | # sex object
60 | # age float64
61 | # sibsp float64
62 | # parch float64
63 | # ticket object
64 | # fare float64
65 | # cabin object
66 | # embarked object
67 | # boat object
68 | # body float64
69 | # home.dest object
70 | # is_survived int64
71 | # >>> set(df['pclass'])
72 | # {1.0, 2.0, 3.0}
73 |
74 |
75 | def do_test(filename, data):
76 | if not os.path.isfile(filename):
77 | pickle.dump(data, open(filename, 'wb'))
78 | truth = pickle.load(open(filename, 'rb'))
79 | try:
80 | np.testing.assert_almost_equal(data, truth)
81 | print(f'{filename} test passed')
82 | except AssertionError as ex:
83 | print(f'{filename} test failed {ex}')
84 |
85 |
86 | def do_pandas_test(filename, data):
87 | if not os.path.isfile(filename):
88 | data.to_pickle(filename)
89 | truth = pd.read_pickle(filename)
90 | try:
91 | pd.testing.assert_frame_equal(data, truth)
92 | print(f'{filename} pandas test passed')
93 | except AssertionError as ex:
94 | print(f'{filename} pandas test failed {ex}')
95 |
96 |
97 | class SqlLoader:
98 | def __init__(self, connection_string):
99 | engine = create_engine(connection_string)
100 | self.connection = engine.connect()
101 |
102 | def get_passengers(self):
103 | query = """
104 | SELECT
105 | tbl_passengers.*,
106 | tbl_targets.is_survived
107 | FROM
108 | tbl_passengers
109 | JOIN
110 | tbl_targets
111 | ON
112 | tbl_passengers.pid=tbl_targets.pid
113 | """
114 | return pd.read_sql(query, con=self.connection)
115 |
116 |
117 | class TestLoader:
118 | def __init__(self, passengers_filename, real_loader):
119 | self.passengers_filename = passengers_filename
120 | self.real_loader = real_loader
121 | if not os.path.isfile(self.passengers_filename):
122 | df = self.real_loader.get_passengers()
123 | df.to_pickle(self.passengers_filename)
124 |
125 | def get_passengers(self):
126 | return pd.read_pickle(self.passengers_filename)
127 |
128 |
129 | class TitanicModelCreator:
130 | def __init__(self, loader):
131 | self.loader = loader
132 | np.random.seed(42)
133 |
134 | def run(self):
135 | df = self.loader.get_passengers()
136 |
137 | # parch = Parents/Children, sibsp = Siblings/Spouses
138 | df['family_size'] = df['parch'] + df['sibsp']
139 | df['is_alone'] = [
140 | 1 if family_size == 1 else 0 for family_size in df['family_size']
141 | ]
142 |
143 | df['title'] = [name.split(',')[1].split('.')[0].strip() for name in df['name']]
144 | rare_titles = {k for k, v in Counter(df['title']).items() if v < 10}
145 | df['title'] = [
146 | 'rare' if title in rare_titles else title for title in df['title']
147 | ]
148 |
149 | targets = [int(v) for v in df['is_survived']]
150 | df = df[
151 | [
152 | 'pclass',
153 | 'sex',
154 | 'age',
155 | 'ticket',
156 | 'family_size',
157 | 'fare',
158 | 'embarked',
159 | 'is_alone',
160 | 'title',
161 | ]
162 | ]
163 |
164 | X_train, X_test, y_train, y_test = train_test_split(
165 | df, targets, stratify=targets, test_size=0.2
166 | )
167 |
168 | X_train_categorical = X_train[
169 | ['embarked', 'sex', 'pclass', 'title', 'is_alone']
170 | ]
171 | X_test_categorical = X_test[['embarked', 'sex', 'pclass', 'title', 'is_alone']]
172 |
173 | one_hot_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False).fit(
174 | X_train_categorical
175 | )
176 | X_train_categorical_one_hot = one_hot_encoder.transform(X_train_categorical)
177 | X_test_categorical_one_hot = one_hot_encoder.transform(X_test_categorical)
178 |
179 | X_train_numerical = X_train[['age', 'fare', 'family_size']]
180 | X_test_numerical = X_test[['age', 'fare', 'family_size']]
181 | knn_imputer = KNNImputer(n_neighbors=5).fit(X_train_numerical)
182 | X_train_numerical_imputed = knn_imputer.transform(X_train_numerical)
183 | X_test_numerical_imputed = knn_imputer.transform(X_test_numerical)
184 |
185 | robust_scaler = RobustScaler().fit(X_train_numerical_imputed)
186 | X_train_numerical_imputed_scaled = robust_scaler.transform(
187 | X_train_numerical_imputed
188 | )
189 | X_test_numerical_imputed_scaled = robust_scaler.transform(
190 | X_test_numerical_imputed
191 | )
192 |
193 | X_train_processed = np.hstack(
194 | (X_train_categorical_one_hot, X_train_numerical_imputed_scaled)
195 | )
196 | X_test_processed = np.hstack(
197 | (X_test_categorical_one_hot, X_test_numerical_imputed_scaled)
198 | )
199 |
200 | model = LogisticRegression(random_state=0).fit(X_train_processed, y_train)
201 | y_train_estimation = model.predict(X_train_processed)
202 | y_test_estimation = model.predict(X_test_processed)
203 |
204 | cm_train = confusion_matrix(y_train, y_train_estimation)
205 |
206 | cm_test = confusion_matrix(y_test, y_test_estimation)
207 |
208 | print('cm_train', cm_train)
209 | print('cm_test', cm_test)
210 |
211 | do_test('../data/cm_test.pkl', cm_test)
212 | do_test('../data/cm_train.pkl', cm_train)
213 | do_test('../data/X_train_processed.pkl', X_train_processed)
214 | do_test('../data/X_test_processed.pkl', X_test_processed)
215 |
216 | do_pandas_test('../data/df.pkl', df)
217 |
218 |
219 | def main(param: str = 'pass'):
220 | titanic_model_creator = TitanicModelCreator(
221 | loader=SqlLoader(connection_string='sqlite:///../data/titanic.db')
222 | )
223 | titanic_model_creator.run()
224 |
225 |
226 | def test_main(param: str = 'pass'):
227 | titanic_model_creator = TitanicModelCreator(
228 | loader=TestLoader(
229 | passengers_filename='../data/passengers_with_is_survived.pkl',
230 | real_loader=SqlLoader(connection_string='sqlite:///../data/titanic.db'),
231 | )
232 | )
233 | titanic_model_creator.run()
234 |
235 |
236 | if __name__ == "__main__":
237 | typer.run(test_main)
238 |
--------------------------------------------------------------------------------
/Step11/titanic_model.sh:
--------------------------------------------------------------------------------
1 | python3.8 ./titanic_model.py $1
2 |
--------------------------------------------------------------------------------
/Step12/README11.md:
--------------------------------------------------------------------------------
1 | ### Step 11: Create domain data object based data loader
2 |
3 | - Create `PassengerLoader` class that takes a "real"/"old" loader
4 | - In its `get_passengers` function, load the data from the loader and create the `Passenger` objects
5 | - Copy the data transformations from `TitanicModelCreator.run()`
6 |
7 | Take a look at how the `rare_titles` variable is used in `run()`. After scanning the entire dataset for titles, the ones that appear less than 10 times are selected. This can be done only if you have access to the entire database and this list needs to be maintained. This can cause problems in a real setting when the above operation is too difficult to do. For example if you have millions of items or a constant stream. This kind of dependencies are common in legacy code and one of the goals of refactoring is to identify these and make explicit. Here we will use a constant but in a productionised environment this might need a whole separate service.
8 |
9 | `PassengerLoader` implements the Factory Design Pattern. Factories are classes that create other classes, they are a type of adapter that hides away where the data is coming from and how is it stored and return only abstract domain relevant classes that you can use downstream. Factories are one of two (later increased to three) fundamentally relevant Design Patterns for Data Science workflows:
10 |
11 | - [You only need 2 Design Patterns to improve the quality of your code in a data science project](https://laszlo.substack.com/p/you-only-need-2-design-patterns-to)
12 | - [Clean Architecture: How to structure your ML projects to reduce technical debt](https://laszlo.substack.com/p/slides-for-my-talk-at-pydata-london)
13 |
--------------------------------------------------------------------------------
/Step12/make_venv.sh:
--------------------------------------------------------------------------------
1 | python3.8 -m venv .venv
2 | source .venv/bin/activate
3 | pip3 install --upgrade pip
4 | pip3 install setuptools==57.1.0
5 | pip3 install wheel
6 | pip3 install -r requirements.txt
7 |
--------------------------------------------------------------------------------
/Step12/requirements.txt:
--------------------------------------------------------------------------------
1 | typer
2 | scikit-learn
3 | jupyter
4 | pandas
5 | numpy
6 | sqlalchemy
7 | pydantic
8 |
--------------------------------------------------------------------------------
/Step12/titanic_model.py:
--------------------------------------------------------------------------------
1 | import os
2 | import pickle
3 | import typer
4 | import numpy as np
5 | import pandas as pd
6 | from pydantic import BaseModel
7 | from collections import Counter
8 | from sqlalchemy import create_engine
9 |
10 | from sklearn.model_selection import train_test_split
11 | from sklearn.linear_model import LogisticRegression
12 | from sklearn.preprocessing import RobustScaler
13 | from sklearn.preprocessing import OneHotEncoder
14 | from sklearn.impute import KNNImputer
15 | from sklearn.metrics import confusion_matrix
16 |
17 |
18 | class Passenger(BaseModel):
19 | pid: int
20 | pclass: int
21 | sex: str
22 | age: float
23 | ticket: str
24 | family_size: int
25 | fare: float
26 | embarked: str
27 | is_alone: int
28 | title: str
29 | is_survived: int
30 |
31 |
32 | # targets = [int(v) for v in df['is_survived']]
33 | # df = df[[
34 | # 'pclass', 'sex', 'age', 'ticket', 'family_size',
35 | # 'fare', 'embarked', 'is_alone', 'title',
36 | # ]]
37 |
38 | # >>> df[:3].T
39 | # 0 1 2
40 | # pid 0 1 2
41 | # pclass 1.0 1.0 1.0
42 | # name Allen, Miss. Elisabeth Walton Allison, Master. Hudson Trevor Allison, Miss. Helen Loraine
43 | # sex female male female
44 | # age 29.0 0.9167 2.0
45 | # sibsp 0.0 1.0 1.0
46 | # parch 0.0 2.0 2.0
47 | # ticket 24160 113781 113781
48 | # fare 211.3375 151.55 151.55
49 | # cabin B5 C22 C26 C22 C26
50 | # embarked S S S
51 | # boat 2 11 None
52 | # body NaN NaN NaN
53 | # home.dest St Louis, MO Montreal, PQ / Chesterville, ON Montreal, PQ / Chesterville, ON
54 | # is_survived 1 1 0
55 | # >>> df.dtypes
56 | # pid int64
57 | # pclass float64
58 | # name object
59 | # sex object
60 | # age float64
61 | # sibsp float64
62 | # parch float64
63 | # ticket object
64 | # fare float64
65 | # cabin object
66 | # embarked object
67 | # boat object
68 | # body float64
69 | # home.dest object
70 | # is_survived int64
71 | # >>> set(df['pclass'])
72 | # {1.0, 2.0, 3.0}
73 |
74 |
75 | def do_test(filename, data):
76 | if not os.path.isfile(filename):
77 | pickle.dump(data, open(filename, 'wb'))
78 | truth = pickle.load(open(filename, 'rb'))
79 | try:
80 | np.testing.assert_almost_equal(data, truth)
81 | print(f'{filename} test passed')
82 | except AssertionError as ex:
83 | print(f'{filename} test failed {ex}')
84 |
85 |
86 | def do_pandas_test(filename, data):
87 | if not os.path.isfile(filename):
88 | data.to_pickle(filename)
89 | truth = pd.read_pickle(filename)
90 | try:
91 | pd.testing.assert_frame_equal(data, truth)
92 | print(f'{filename} pandas test passed')
93 | except AssertionError as ex:
94 | print(f'{filename} pandas test failed {ex}')
95 |
96 |
97 | class SqlLoader:
98 | def __init__(self, connection_string):
99 | engine = create_engine(connection_string)
100 | self.connection = engine.connect()
101 |
102 | def get_passengers(self):
103 | query = """
104 | SELECT
105 | tbl_passengers.*,
106 | tbl_targets.is_survived
107 | FROM
108 | tbl_passengers
109 | JOIN
110 | tbl_targets
111 | ON
112 | tbl_passengers.pid=tbl_targets.pid
113 | """
114 | return pd.read_sql(query, con=self.connection)
115 |
116 |
117 | class TestLoader:
118 | def __init__(self, passengers_filename, real_loader):
119 | self.passengers_filename = passengers_filename
120 | self.real_loader = real_loader
121 | if not os.path.isfile(self.passengers_filename):
122 | df = self.real_loader.get_passengers()
123 | df.to_pickle(self.passengers_filename)
124 |
125 | def get_passengers(self):
126 | return pd.read_pickle(self.passengers_filename)
127 |
128 |
129 | class PassengerLoader:
130 | def __init__(self, loader, rare_titles=None):
131 | self.loader = loader
132 | self.rare_titles = rare_titles
133 |
134 | def get_passengers(self):
135 | passengers = []
136 | for data in self.loader.get_passengers().itertuples():
137 | # parch = Parents/Children, sibsp = Siblings/Spouses
138 | family_size = int(data.parch + data.sibsp)
139 | # Allen, Miss. Elisabeth Walton
140 | title = data.name.split(',')[1].split('.')[0].strip()
141 | passenger = Passenger(
142 | pid=int(data.pid),
143 | pclass=int(data.pclass),
144 | sex=str(data.sex),
145 | age=float(data.age),
146 | ticket=str(data.ticket),
147 | family_size=family_size,
148 | fare=float(data.fare),
149 | embarked=str(data.embarked),
150 | is_alone=1 if family_size == 1 else 0,
151 | title='rare' if title in self.rare_titles else title,
152 | is_survived=int(data.is_survived),
153 | )
154 | passengers.append(passenger)
155 | return passengers
156 |
157 |
158 | # Not used:
159 | # cabin object
160 | # boat object
161 | # body float64
162 | # home.dest object
163 |
164 |
165 | class TitanicModelCreator:
166 | def __init__(self, loader):
167 | self.loader = loader
168 | np.random.seed(42)
169 |
170 | def run(self):
171 | df = self.loader.get_passengers()
172 |
173 | # parch = Parents/Children, sibsp = Siblings/Spouses
174 | df['family_size'] = df['parch'] + df['sibsp']
175 | df['is_alone'] = [
176 | 1 if family_size == 1 else 0 for family_size in df['family_size']
177 | ]
178 |
179 | df['title'] = [name.split(',')[1].split('.')[0].strip() for name in df['name']]
180 | rare_titles = {k for k, v in Counter(df['title']).items() if v < 10}
181 | df['title'] = [
182 | 'rare' if title in rare_titles else title for title in df['title']
183 | ]
184 |
185 | targets = [int(v) for v in df['is_survived']]
186 | df = df[
187 | [
188 | 'pclass',
189 | 'sex',
190 | 'age',
191 | 'ticket',
192 | 'family_size',
193 | 'fare',
194 | 'embarked',
195 | 'is_alone',
196 | 'title',
197 | ]
198 | ]
199 |
200 | X_train, X_test, y_train, y_test = train_test_split(
201 | df, targets, stratify=targets, test_size=0.2
202 | )
203 |
204 | X_train_categorical = X_train[
205 | ['embarked', 'sex', 'pclass', 'title', 'is_alone']
206 | ]
207 | X_test_categorical = X_test[['embarked', 'sex', 'pclass', 'title', 'is_alone']]
208 |
209 | one_hot_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False).fit(
210 | X_train_categorical
211 | )
212 | X_train_categorical_one_hot = one_hot_encoder.transform(X_train_categorical)
213 | X_test_categorical_one_hot = one_hot_encoder.transform(X_test_categorical)
214 |
215 | X_train_numerical = X_train[['age', 'fare', 'family_size']]
216 | X_test_numerical = X_test[['age', 'fare', 'family_size']]
217 | knn_imputer = KNNImputer(n_neighbors=5).fit(X_train_numerical)
218 | X_train_numerical_imputed = knn_imputer.transform(X_train_numerical)
219 | X_test_numerical_imputed = knn_imputer.transform(X_test_numerical)
220 |
221 | robust_scaler = RobustScaler().fit(X_train_numerical_imputed)
222 | X_train_numerical_imputed_scaled = robust_scaler.transform(
223 | X_train_numerical_imputed
224 | )
225 | X_test_numerical_imputed_scaled = robust_scaler.transform(
226 | X_test_numerical_imputed
227 | )
228 |
229 | X_train_processed = np.hstack(
230 | (X_train_categorical_one_hot, X_train_numerical_imputed_scaled)
231 | )
232 | X_test_processed = np.hstack(
233 | (X_test_categorical_one_hot, X_test_numerical_imputed_scaled)
234 | )
235 |
236 | model = LogisticRegression(random_state=0).fit(X_train_processed, y_train)
237 | y_train_estimation = model.predict(X_train_processed)
238 | y_test_estimation = model.predict(X_test_processed)
239 |
240 | cm_train = confusion_matrix(y_train, y_train_estimation)
241 |
242 | cm_test = confusion_matrix(y_test, y_test_estimation)
243 |
244 | print('cm_train', cm_train)
245 | print('cm_test', cm_test)
246 |
247 | do_test('../data/cm_test.pkl', cm_test)
248 | do_test('../data/cm_train.pkl', cm_train)
249 | do_test('../data/X_train_processed.pkl', X_train_processed)
250 | do_test('../data/X_test_processed.pkl', X_test_processed)
251 |
252 | do_pandas_test('../data/df.pkl', df)
253 |
254 |
255 | def main(param: str = 'pass'):
256 | titanic_model_creator = TitanicModelCreator(
257 | loader=SqlLoader(connection_string='sqlite:///../data/titanic.db')
258 | )
259 | titanic_model_creator.run()
260 |
261 |
262 | def test_main(param: str = 'pass'):
263 | titanic_model_creator = TitanicModelCreator(
264 | loader=TestLoader(
265 | passengers_filename='../data/passengers_with_is_survived.pkl',
266 | real_loader=SqlLoader(connection_string='sqlite:///../data/titanic.db'),
267 | )
268 | )
269 | titanic_model_creator.run()
270 |
271 |
272 | if __name__ == "__main__":
273 | typer.run(test_main)
274 |
--------------------------------------------------------------------------------
/Step12/titanic_model.sh:
--------------------------------------------------------------------------------
1 | python3.8 ./titanic_model.py $1
2 |
--------------------------------------------------------------------------------
/Step13/README12.md:
--------------------------------------------------------------------------------
1 | ### Step 12: Remove any data that is not explicitly needed
2 |
3 | - Update the query in `SqlLoader` to only retrieve the columns that will be used for the model's input
4 |
5 | Simplifying down to the minimum is a goal of refactoring. Anything that is not explicitly needed should be removed. If the requirements change they can be added back again. For example the `ticket` column is in `df` but it is never used again in the program. Remove it.
6 |
--------------------------------------------------------------------------------
/Step13/make_venv.sh:
--------------------------------------------------------------------------------
1 | python3.8 -m venv .venv
2 | source .venv/bin/activate
3 | pip3 install --upgrade pip
4 | pip3 install setuptools==57.1.0
5 | pip3 install wheel
6 | pip3 install -r requirements.txt
7 |
--------------------------------------------------------------------------------
/Step13/requirements.txt:
--------------------------------------------------------------------------------
1 | typer
2 | scikit-learn
3 | jupyter
4 | pandas
5 | numpy
6 | sqlalchemy
7 | pydantic
8 |
--------------------------------------------------------------------------------
/Step13/titanic_model.py:
--------------------------------------------------------------------------------
1 | import os
2 | import pickle
3 | import typer
4 | import numpy as np
5 | import pandas as pd
6 | from pydantic import BaseModel
7 | from collections import Counter
8 | from sqlalchemy import create_engine
9 |
10 | from sklearn.model_selection import train_test_split
11 | from sklearn.linear_model import LogisticRegression
12 | from sklearn.preprocessing import RobustScaler
13 | from sklearn.preprocessing import OneHotEncoder
14 | from sklearn.impute import KNNImputer
15 | from sklearn.metrics import confusion_matrix
16 |
17 |
18 | class Passenger(BaseModel):
19 | pid: int
20 | pclass: int
21 | sex: str
22 | age: float
23 | family_size: int
24 | fare: float
25 | embarked: str
26 | is_alone: int
27 | title: str
28 | is_survived: int
29 |
30 |
31 | def do_test(filename, data):
32 | if not os.path.isfile(filename):
33 | pickle.dump(data, open(filename, 'wb'))
34 | truth = pickle.load(open(filename, 'rb'))
35 | try:
36 | np.testing.assert_almost_equal(data, truth)
37 | print(f'{filename} test passed')
38 | except AssertionError as ex:
39 | print(f'{filename} test failed {ex}')
40 |
41 |
42 | def do_pandas_test(filename, data):
43 | if not os.path.isfile(filename):
44 | data.to_pickle(filename)
45 | truth = pd.read_pickle(filename)
46 | try:
47 | pd.testing.assert_frame_equal(data, truth)
48 | print(f'{filename} pandas test passed')
49 | except AssertionError as ex:
50 | print(f'{filename} pandas test failed {ex}')
51 |
52 |
53 | class SqlLoader:
54 | def __init__(self, connection_string):
55 | engine = create_engine(connection_string)
56 | self.connection = engine.connect()
57 |
58 | def get_passengers(self):
59 | query = """
60 | SELECT
61 | tbl_passengers.pid,
62 | tbl_passengers.pclass,
63 | tbl_passengers.sex,
64 | tbl_passengers.age,
65 | tbl_passengers.parch,
66 | tbl_passengers.sibsp,
67 | tbl_passengers.fare,
68 | tbl_passengers.embarked,
69 | tbl_passengers.name,
70 | tbl_targets.is_survived
71 | FROM
72 | tbl_passengers
73 | JOIN
74 | tbl_targets
75 | ON
76 | tbl_passengers.pid=tbl_targets.pid
77 | """
78 | return pd.read_sql(query, con=self.connection)
79 |
80 |
81 | class TestLoader:
82 | def __init__(self, passengers_filename, real_loader):
83 | self.passengers_filename = passengers_filename
84 | self.real_loader = real_loader
85 | if not os.path.isfile(self.passengers_filename):
86 | df = self.real_loader.get_passengers()
87 | df.to_pickle(self.passengers_filename)
88 |
89 | def get_passengers(self):
90 | return pd.read_pickle(self.passengers_filename)
91 |
92 |
93 | class PassengerLoader:
94 | def __init__(self, loader, rare_titles=None):
95 | self.loader = loader
96 | self.rare_titles = rare_titles
97 |
98 | def get_passengers(self):
99 | passengers = []
100 | for data in self.loader.get_passengers().itertuples():
101 | # parch = Parents/Children, sibsp = Siblings/Spouses
102 | family_size = int(data.parch + data.sibsp)
103 | # Allen, Miss. Elisabeth Walton
104 | title = data.name.split(',')[1].split('.')[0].strip()
105 | passenger = Passenger(
106 | pid=int(data.pid),
107 | pclass=int(data.pclass),
108 | sex=str(data.sex),
109 | age=float(data.age),
110 | family_size=family_size,
111 | fare=float(data.fare),
112 | embarked=str(data.embarked),
113 | is_alone=1 if family_size == 1 else 0,
114 | title='rare' if title in self.rare_titles else title,
115 | is_survived=int(data.is_survived),
116 | )
117 | passengers.append(passenger)
118 | return passengers
119 |
120 |
121 | class TitanicModelCreator:
122 | def __init__(self, loader):
123 | self.loader = loader
124 | np.random.seed(42)
125 |
126 | def run(self):
127 | df = self.loader.get_passengers()
128 |
129 | # parch = Parents/Children, sibsp = Siblings/Spouses
130 | df['family_size'] = df['parch'] + df['sibsp']
131 | df['is_alone'] = [
132 | 1 if family_size == 1 else 0 for family_size in df['family_size']
133 | ]
134 |
135 | df['title'] = [name.split(',')[1].split('.')[0].strip() for name in df['name']]
136 | rare_titles = {k for k, v in Counter(df['title']).items() if v < 10}
137 | df['title'] = [
138 | 'rare' if title in rare_titles else title for title in df['title']
139 | ]
140 |
141 | targets = [int(v) for v in df['is_survived']]
142 | df = df[
143 | [
144 | 'pclass',
145 | 'sex',
146 | 'age',
147 | 'family_size',
148 | 'fare',
149 | 'embarked',
150 | 'is_alone',
151 | 'title',
152 | ]
153 | ]
154 |
155 | X_train, X_test, y_train, y_test = train_test_split(
156 | df, targets, stratify=targets, test_size=0.2
157 | )
158 |
159 | X_train_categorical = X_train[
160 | ['embarked', 'sex', 'pclass', 'title', 'is_alone']
161 | ]
162 | X_test_categorical = X_test[['embarked', 'sex', 'pclass', 'title', 'is_alone']]
163 |
164 | one_hot_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False).fit(
165 | X_train_categorical
166 | )
167 | X_train_categorical_one_hot = one_hot_encoder.transform(X_train_categorical)
168 | X_test_categorical_one_hot = one_hot_encoder.transform(X_test_categorical)
169 |
170 | X_train_numerical = X_train[['age', 'fare', 'family_size']]
171 | X_test_numerical = X_test[['age', 'fare', 'family_size']]
172 | knn_imputer = KNNImputer(n_neighbors=5).fit(X_train_numerical)
173 | X_train_numerical_imputed = knn_imputer.transform(X_train_numerical)
174 | X_test_numerical_imputed = knn_imputer.transform(X_test_numerical)
175 |
176 | robust_scaler = RobustScaler().fit(X_train_numerical_imputed)
177 | X_train_numerical_imputed_scaled = robust_scaler.transform(
178 | X_train_numerical_imputed
179 | )
180 | X_test_numerical_imputed_scaled = robust_scaler.transform(
181 | X_test_numerical_imputed
182 | )
183 |
184 | X_train_processed = np.hstack(
185 | (X_train_categorical_one_hot, X_train_numerical_imputed_scaled)
186 | )
187 | X_test_processed = np.hstack(
188 | (X_test_categorical_one_hot, X_test_numerical_imputed_scaled)
189 | )
190 |
191 | model = LogisticRegression(random_state=0).fit(X_train_processed, y_train)
192 | y_train_estimation = model.predict(X_train_processed)
193 | y_test_estimation = model.predict(X_test_processed)
194 |
195 | cm_train = confusion_matrix(y_train, y_train_estimation)
196 |
197 | cm_test = confusion_matrix(y_test, y_test_estimation)
198 |
199 | print('cm_train', cm_train)
200 | print('cm_test', cm_test)
201 |
202 | do_test('../data/cm_test.pkl', cm_test)
203 | do_test('../data/cm_train.pkl', cm_train)
204 | do_test('../data/X_train_processed.pkl', X_train_processed)
205 | do_test('../data/X_test_processed.pkl', X_test_processed)
206 |
207 | do_pandas_test('../data/df.pkl', df)
208 |
209 |
210 | def main(param: str = 'pass'):
211 | titanic_model_creator = TitanicModelCreator(
212 | loader=SqlLoader(connection_string='sqlite:///../data/titanic.db')
213 | )
214 | titanic_model_creator.run()
215 |
216 |
217 | def test_main(param: str = 'pass'):
218 | titanic_model_creator = TitanicModelCreator(
219 | loader=TestLoader(
220 | passengers_filename='../data/passengers_with_is_survived.pkl',
221 | real_loader=SqlLoader(connection_string='sqlite:///../data/titanic.db'),
222 | )
223 | )
224 | titanic_model_creator.run()
225 |
226 |
227 | if __name__ == "__main__":
228 | typer.run(test_main)
229 |
--------------------------------------------------------------------------------
/Step13/titanic_model.sh:
--------------------------------------------------------------------------------
1 | python3.8 ./titanic_model.py $1
2 |
--------------------------------------------------------------------------------
/Step14/README13.md:
--------------------------------------------------------------------------------
1 | ### Step 13: Use Passenger objects in the program
2 |
3 | - Add `PassengerLoader` to `main` and `test_main`
4 | - Add the `RARE_TITLES` constant
5 | - Convert the classes back into the `df` dataframe with `passenger.dict()`
6 |
7 | It is very important to do refactoring incrementally. Any change should be small enough that if the tests fail the source can be found quickly. So for now we stop at using the new loader but do not change anything else.
8 |
--------------------------------------------------------------------------------
/Step14/make_venv.sh:
--------------------------------------------------------------------------------
1 | python3.8 -m venv .venv
2 | source .venv/bin/activate
3 | pip3 install --upgrade pip
4 | pip3 install setuptools==57.1.0
5 | pip3 install wheel
6 | pip3 install -r requirements.txt
7 |
--------------------------------------------------------------------------------
/Step14/requirements.txt:
--------------------------------------------------------------------------------
1 | typer
2 | scikit-learn
3 | jupyter
4 | pandas
5 | numpy
6 | sqlalchemy
7 | pydantic
8 |
--------------------------------------------------------------------------------
/Step14/titanic_model.py:
--------------------------------------------------------------------------------
1 | import os
2 | import pickle
3 | import typer
4 | import numpy as np
5 | import pandas as pd
6 | from pydantic import BaseModel
7 | from sqlalchemy import create_engine
8 |
9 | from sklearn.model_selection import train_test_split
10 | from sklearn.linear_model import LogisticRegression
11 | from sklearn.preprocessing import RobustScaler
12 | from sklearn.preprocessing import OneHotEncoder
13 | from sklearn.impute import KNNImputer
14 | from sklearn.metrics import confusion_matrix
15 |
16 |
17 | RARE_TITLES = {
18 | 'Capt',
19 | 'Col',
20 | 'Don',
21 | 'Dona',
22 | 'Dr',
23 | 'Jonkheer',
24 | 'Lady',
25 | 'Major',
26 | 'Mlle',
27 | 'Mme',
28 | 'Ms',
29 | 'Rev',
30 | 'Sir',
31 | 'the Countess',
32 | }
33 |
34 |
35 | class Passenger(BaseModel):
36 | pid: int
37 | pclass: int
38 | sex: str
39 | age: float
40 | family_size: int
41 | fare: float
42 | embarked: str
43 | is_alone: int
44 | title: str
45 | is_survived: int
46 |
47 |
48 | def do_test(filename, data):
49 | if not os.path.isfile(filename):
50 | pickle.dump(data, open(filename, 'wb'))
51 | truth = pickle.load(open(filename, 'rb'))
52 | try:
53 | np.testing.assert_almost_equal(data, truth)
54 | print(f'{filename} test passed')
55 | except AssertionError as ex:
56 | print(f'{filename} test failed {ex}')
57 |
58 |
59 | def do_pandas_test(filename, data):
60 | if not os.path.isfile(filename):
61 | data.to_pickle(filename)
62 | truth = pd.read_pickle(filename)
63 | try:
64 | pd.testing.assert_frame_equal(data, truth)
65 | print(f'{filename} pandas test passed')
66 | except AssertionError as ex:
67 | print(f'{filename} pandas test failed {ex}')
68 |
69 |
70 | class SqlLoader:
71 | def __init__(self, connection_string):
72 | engine = create_engine(connection_string)
73 | self.connection = engine.connect()
74 |
75 | def get_passengers(self):
76 | query = """
77 | SELECT
78 | tbl_passengers.pid,
79 | tbl_passengers.pclass,
80 | tbl_passengers.sex,
81 | tbl_passengers.age,
82 | tbl_passengers.parch,
83 | tbl_passengers.sibsp,
84 | tbl_passengers.fare,
85 | tbl_passengers.embarked,
86 | tbl_passengers.name,
87 | tbl_targets.is_survived
88 | FROM
89 | tbl_passengers
90 | JOIN
91 | tbl_targets
92 | ON
93 | tbl_passengers.pid=tbl_targets.pid
94 | """
95 | return pd.read_sql(query, con=self.connection)
96 |
97 |
98 | class TestLoader:
99 | def __init__(self, passengers_filename, real_loader):
100 | self.passengers_filename = passengers_filename
101 | self.real_loader = real_loader
102 | if not os.path.isfile(self.passengers_filename):
103 | df = self.real_loader.get_passengers()
104 | df.to_pickle(self.passengers_filename)
105 |
106 | def get_passengers(self):
107 | return pd.read_pickle(self.passengers_filename)
108 |
109 |
110 | class PassengerLoader:
111 | def __init__(self, loader, rare_titles=None):
112 | self.loader = loader
113 | self.rare_titles = rare_titles
114 |
115 | def get_passengers(self):
116 | passengers = []
117 | for data in self.loader.get_passengers().itertuples():
118 | # parch = Parents/Children, sibsp = Siblings/Spouses
119 | family_size = int(data.parch + data.sibsp)
120 | # Allen, Miss. Elisabeth Walton
121 | title = data.name.split(',')[1].split('.')[0].strip()
122 | passenger = Passenger(
123 | pid=int(data.pid),
124 | pclass=int(data.pclass),
125 | sex=str(data.sex),
126 | age=float(data.age),
127 | family_size=family_size,
128 | fare=float(data.fare),
129 | embarked=str(data.embarked),
130 | is_alone=1 if family_size == 1 else 0,
131 | title='rare' if title in self.rare_titles else title,
132 | is_survived=int(data.is_survived),
133 | )
134 | passengers.append(passenger)
135 | return passengers
136 |
137 |
138 | class TitanicModelCreator:
139 | def __init__(self, loader):
140 | self.loader = loader
141 | np.random.seed(42)
142 |
143 | def run(self):
144 | df = pd.DataFrame([v.dict() for v in self.loader.get_passengers()])
145 | targets = [int(v) for v in df['is_survived']]
146 |
147 | X_train, X_test, y_train, y_test = train_test_split(
148 | df, targets, stratify=targets, test_size=0.2
149 | )
150 |
151 | X_train_categorical = X_train[
152 | ['embarked', 'sex', 'pclass', 'title', 'is_alone']
153 | ]
154 | X_test_categorical = X_test[['embarked', 'sex', 'pclass', 'title', 'is_alone']]
155 |
156 | one_hot_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False).fit(
157 | X_train_categorical
158 | )
159 | X_train_categorical_one_hot = one_hot_encoder.transform(X_train_categorical)
160 | X_test_categorical_one_hot = one_hot_encoder.transform(X_test_categorical)
161 |
162 | X_train_numerical = X_train[['age', 'fare', 'family_size']]
163 | X_test_numerical = X_test[['age', 'fare', 'family_size']]
164 | knn_imputer = KNNImputer(n_neighbors=5).fit(X_train_numerical)
165 | X_train_numerical_imputed = knn_imputer.transform(X_train_numerical)
166 | X_test_numerical_imputed = knn_imputer.transform(X_test_numerical)
167 |
168 | robust_scaler = RobustScaler().fit(X_train_numerical_imputed)
169 | X_train_numerical_imputed_scaled = robust_scaler.transform(
170 | X_train_numerical_imputed
171 | )
172 | X_test_numerical_imputed_scaled = robust_scaler.transform(
173 | X_test_numerical_imputed
174 | )
175 |
176 | X_train_processed = np.hstack(
177 | (X_train_categorical_one_hot, X_train_numerical_imputed_scaled)
178 | )
179 | X_test_processed = np.hstack(
180 | (X_test_categorical_one_hot, X_test_numerical_imputed_scaled)
181 | )
182 |
183 | model = LogisticRegression(random_state=0).fit(X_train_processed, y_train)
184 | y_train_estimation = model.predict(X_train_processed)
185 | y_test_estimation = model.predict(X_test_processed)
186 |
187 | cm_train = confusion_matrix(y_train, y_train_estimation)
188 |
189 | cm_test = confusion_matrix(y_test, y_test_estimation)
190 |
191 | print('cm_train', cm_train)
192 | print('cm_test', cm_test)
193 |
194 | do_test('../data/cm_test.pkl', cm_test)
195 | do_test('../data/cm_train.pkl', cm_train)
196 | do_test('../data/X_train_processed.pkl', X_train_processed)
197 | do_test('../data/X_test_processed.pkl', X_test_processed)
198 |
199 | do_pandas_test('../data/df_no_tickets.pkl', df)
200 |
201 |
202 | def main(param: str = 'pass'):
203 | titanic_model_creator = TitanicModelCreator(
204 | loader=PassengerLoader(
205 | loader=SqlLoader(connection_string='sqlite:///../data/titanic.db'),
206 | rare_titles=RARE_TITLES,
207 | )
208 | )
209 | titanic_model_creator.run()
210 |
211 |
212 | def test_main(param: str = 'pass'):
213 | titanic_model_creator = TitanicModelCreator(
214 | loader=PassengerLoader(
215 | loader=TestLoader(
216 | passengers_filename='../data/passengers_with_is_survived.pkl',
217 | real_loader=SqlLoader(connection_string='sqlite:///../data/titanic.db'),
218 | ),
219 | rare_titles=RARE_TITLES,
220 | )
221 | )
222 | titanic_model_creator.run()
223 |
224 |
225 | if __name__ == "__main__":
226 | typer.run(test_main)
227 |
--------------------------------------------------------------------------------
/Step14/titanic_model.sh:
--------------------------------------------------------------------------------
1 | python3.8 ./titanic_model.py $1
2 |
--------------------------------------------------------------------------------
/Step15/README14.md:
--------------------------------------------------------------------------------
1 | ### Step 14: Separate training and evaluation functions
2 |
3 | - Move all code related to evaluation (variables that has `_test_` in their name) into one group
4 |
5 | After creating the model first it is trained, then it is evaluated on the training data, then it is evaluated on the testing data. These should be separated from each other into their own logical place. This will prepare to move them into an actually separated place.
6 |
--------------------------------------------------------------------------------
/Step15/make_venv.sh:
--------------------------------------------------------------------------------
1 | python3.8 -m venv .venv
2 | source .venv/bin/activate
3 | pip3 install --upgrade pip
4 | pip3 install setuptools==57.1.0
5 | pip3 install wheel
6 | pip3 install -r requirements.txt
7 |
--------------------------------------------------------------------------------
/Step15/requirements.txt:
--------------------------------------------------------------------------------
1 | typer
2 | scikit-learn
3 | jupyter
4 | pandas
5 | numpy
6 | sqlalchemy
7 | pydantic
8 |
--------------------------------------------------------------------------------
/Step15/titanic_model.py:
--------------------------------------------------------------------------------
1 | import os
2 | import pickle
3 | import typer
4 | import numpy as np
5 | import pandas as pd
6 | from pydantic import BaseModel
7 | from sqlalchemy import create_engine
8 |
9 | from sklearn.model_selection import train_test_split
10 | from sklearn.linear_model import LogisticRegression
11 | from sklearn.preprocessing import RobustScaler
12 | from sklearn.preprocessing import OneHotEncoder
13 | from sklearn.impute import KNNImputer
14 | from sklearn.metrics import confusion_matrix
15 |
16 |
17 | RARE_TITLES = {
18 | 'Capt',
19 | 'Col',
20 | 'Don',
21 | 'Dona',
22 | 'Dr',
23 | 'Jonkheer',
24 | 'Lady',
25 | 'Major',
26 | 'Mlle',
27 | 'Mme',
28 | 'Ms',
29 | 'Rev',
30 | 'Sir',
31 | 'the Countess',
32 | }
33 |
34 |
35 | class Passenger(BaseModel):
36 | pid: int
37 | pclass: int
38 | sex: str
39 | age: float
40 | family_size: int
41 | fare: float
42 | embarked: str
43 | is_alone: int
44 | title: str
45 | is_survived: int
46 |
47 |
48 | def do_test(filename, data):
49 | if not os.path.isfile(filename):
50 | pickle.dump(data, open(filename, 'wb'))
51 | truth = pickle.load(open(filename, 'rb'))
52 | try:
53 | np.testing.assert_almost_equal(data, truth)
54 | print(f'{filename} test passed')
55 | except AssertionError as ex:
56 | print(f'{filename} test failed {ex}')
57 |
58 |
59 | def do_pandas_test(filename, data):
60 | if not os.path.isfile(filename):
61 | data.to_pickle(filename)
62 | truth = pd.read_pickle(filename)
63 | try:
64 | pd.testing.assert_frame_equal(data, truth)
65 | print(f'{filename} pandas test passed')
66 | except AssertionError as ex:
67 | print(f'{filename} pandas test failed {ex}')
68 |
69 |
70 | class SqlLoader:
71 | def __init__(self, connection_string):
72 | engine = create_engine(connection_string)
73 | self.connection = engine.connect()
74 |
75 | def get_passengers(self):
76 | query = """
77 | SELECT
78 | tbl_passengers.pid,
79 | tbl_passengers.pclass,
80 | tbl_passengers.sex,
81 | tbl_passengers.age,
82 | tbl_passengers.parch,
83 | tbl_passengers.sibsp,
84 | tbl_passengers.fare,
85 | tbl_passengers.embarked,
86 | tbl_passengers.name,
87 | tbl_targets.is_survived
88 | FROM
89 | tbl_passengers
90 | JOIN
91 | tbl_targets
92 | ON
93 | tbl_passengers.pid=tbl_targets.pid
94 | """
95 | return pd.read_sql(query, con=self.connection)
96 |
97 |
98 | class TestLoader:
99 | def __init__(self, passengers_filename, real_loader):
100 | self.passengers_filename = passengers_filename
101 | self.real_loader = real_loader
102 | if not os.path.isfile(self.passengers_filename):
103 | df = self.real_loader.get_passengers()
104 | df.to_pickle(self.passengers_filename)
105 |
106 | def get_passengers(self):
107 | return pd.read_pickle(self.passengers_filename)
108 |
109 |
110 | class PassengerLoader:
111 | def __init__(self, loader, rare_titles=None):
112 | self.loader = loader
113 | self.rare_titles = rare_titles
114 |
115 | def get_passengers(self):
116 | passengers = []
117 | for data in self.loader.get_passengers().itertuples():
118 | # parch = Parents/Children, sibsp = Siblings/Spouses
119 | family_size = int(data.parch + data.sibsp)
120 | # Allen, Miss. Elisabeth Walton
121 | title = data.name.split(',')[1].split('.')[0].strip()
122 | passenger = Passenger(
123 | pid=int(data.pid),
124 | pclass=int(data.pclass),
125 | sex=str(data.sex),
126 | age=float(data.age),
127 | family_size=family_size,
128 | fare=float(data.fare),
129 | embarked=str(data.embarked),
130 | is_alone=1 if family_size == 1 else 0,
131 | title='rare' if title in self.rare_titles else title,
132 | is_survived=int(data.is_survived),
133 | )
134 | passengers.append(passenger)
135 | return passengers
136 |
137 |
138 | class TitanicModelCreator:
139 | def __init__(self, loader):
140 | self.loader = loader
141 | np.random.seed(42)
142 |
143 | def run(self):
144 | df = pd.DataFrame([v.dict() for v in self.loader.get_passengers()])
145 | targets = [int(v) for v in df['is_survived']]
146 |
147 | X_train, X_test, y_train, y_test = train_test_split(
148 | df, targets, stratify=targets, test_size=0.2
149 | )
150 |
151 | # --- TRAINING ---
152 | X_train_categorical = X_train[
153 | ['embarked', 'sex', 'pclass', 'title', 'is_alone']
154 | ]
155 |
156 | one_hot_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False).fit(
157 | X_train_categorical
158 | )
159 | X_train_categorical_one_hot = one_hot_encoder.transform(X_train_categorical)
160 |
161 | X_train_numerical = X_train[['age', 'fare', 'family_size']]
162 | knn_imputer = KNNImputer(n_neighbors=5).fit(X_train_numerical)
163 | X_train_numerical_imputed = knn_imputer.transform(X_train_numerical)
164 |
165 | robust_scaler = RobustScaler().fit(X_train_numerical_imputed)
166 | X_train_numerical_imputed_scaled = robust_scaler.transform(
167 | X_train_numerical_imputed
168 | )
169 |
170 | X_train_processed = np.hstack(
171 | (X_train_categorical_one_hot, X_train_numerical_imputed_scaled)
172 | )
173 |
174 | model = LogisticRegression(random_state=0).fit(X_train_processed, y_train)
175 | y_train_estimation = model.predict(X_train_processed)
176 |
177 | cm_train = confusion_matrix(y_train, y_train_estimation)
178 |
179 | # --- TESTING ---
180 | X_test_categorical = X_test[['embarked', 'sex', 'pclass', 'title', 'is_alone']]
181 | X_test_categorical_one_hot = one_hot_encoder.transform(X_test_categorical)
182 |
183 | X_test_numerical = X_test[['age', 'fare', 'family_size']]
184 | X_test_numerical_imputed = knn_imputer.transform(X_test_numerical)
185 | X_test_numerical_imputed_scaled = robust_scaler.transform(
186 | X_test_numerical_imputed
187 | )
188 |
189 | X_test_processed = np.hstack(
190 | (X_test_categorical_one_hot, X_test_numerical_imputed_scaled)
191 | )
192 |
193 | y_test_estimation = model.predict(X_test_processed)
194 | cm_test = confusion_matrix(y_test, y_test_estimation)
195 |
196 | print('cm_train', cm_train)
197 | print('cm_test', cm_test)
198 |
199 | do_test('../data/cm_test.pkl', cm_test)
200 | do_test('../data/cm_train.pkl', cm_train)
201 | do_test('../data/X_train_processed.pkl', X_train_processed)
202 | do_test('../data/X_test_processed.pkl', X_test_processed)
203 |
204 | do_pandas_test('../data/df_no_tickets.pkl', df)
205 |
206 |
207 | def main(param: str = 'pass'):
208 | titanic_model_creator = TitanicModelCreator(
209 | loader=PassengerLoader(
210 | loader=SqlLoader(connection_string='sqlite:///../data/titanic.db'),
211 | rare_titles=RARE_TITLES,
212 | )
213 | )
214 | titanic_model_creator.run()
215 |
216 |
217 | def test_main(param: str = 'pass'):
218 | titanic_model_creator = TitanicModelCreator(
219 | loader=PassengerLoader(
220 | loader=TestLoader(
221 | passengers_filename='../data/passengers_with_is_survived.pkl',
222 | real_loader=SqlLoader(connection_string='sqlite:///../data/titanic.db'),
223 | ),
224 | rare_titles=RARE_TITLES,
225 | )
226 | )
227 | titanic_model_creator.run()
228 |
229 |
230 | if __name__ == "__main__":
231 | typer.run(test_main)
232 |
--------------------------------------------------------------------------------
/Step15/titanic_model.sh:
--------------------------------------------------------------------------------
1 | python3.8 ./titanic_model.py $1
2 |
--------------------------------------------------------------------------------
/Step16/README15.md:
--------------------------------------------------------------------------------
1 | ### Step 15: Create `TitanicModel` class
2 |
3 | - Create a class that has all the `sklearn` components as member variables
4 | - Instantiate these before the "Training" block
5 | - Use these instead of the local ones
6 |
7 | The goal of the whole program is to create a model, despite this until now there was no single object describing this model. The next steps is to establish the concept of this model and what kind of services it is providing for `TitanicModelCreator`.
8 |
--------------------------------------------------------------------------------
/Step16/make_venv.sh:
--------------------------------------------------------------------------------
1 | python3.8 -m venv .venv
2 | source .venv/bin/activate
3 | pip3 install --upgrade pip
4 | pip3 install setuptools==57.1.0
5 | pip3 install wheel
6 | pip3 install -r requirements.txt
7 |
--------------------------------------------------------------------------------
/Step16/requirements.txt:
--------------------------------------------------------------------------------
1 | typer
2 | scikit-learn
3 | jupyter
4 | pandas
5 | numpy
6 | sqlalchemy
7 | pydantic
8 |
--------------------------------------------------------------------------------
/Step16/titanic_model.py:
--------------------------------------------------------------------------------
1 | import os
2 | import pickle
3 | import typer
4 | import numpy as np
5 | import pandas as pd
6 | from pydantic import BaseModel
7 | from sqlalchemy import create_engine
8 |
9 | from sklearn.model_selection import train_test_split
10 | from sklearn.linear_model import LogisticRegression
11 | from sklearn.preprocessing import RobustScaler
12 | from sklearn.preprocessing import OneHotEncoder
13 | from sklearn.impute import KNNImputer
14 | from sklearn.metrics import confusion_matrix
15 |
16 |
17 | RARE_TITLES = {
18 | 'Capt',
19 | 'Col',
20 | 'Don',
21 | 'Dona',
22 | 'Dr',
23 | 'Jonkheer',
24 | 'Lady',
25 | 'Major',
26 | 'Mlle',
27 | 'Mme',
28 | 'Ms',
29 | 'Rev',
30 | 'Sir',
31 | 'the Countess',
32 | }
33 |
34 |
35 | class Passenger(BaseModel):
36 | pid: int
37 | pclass: int
38 | sex: str
39 | age: float
40 | family_size: int
41 | fare: float
42 | embarked: str
43 | is_alone: int
44 | title: str
45 | is_survived: int
46 |
47 |
48 | def do_test(filename, data):
49 | if not os.path.isfile(filename):
50 | pickle.dump(data, open(filename, 'wb'))
51 | truth = pickle.load(open(filename, 'rb'))
52 | try:
53 | np.testing.assert_almost_equal(data, truth)
54 | print(f'{filename} test passed')
55 | except AssertionError as ex:
56 | print(f'{filename} test failed {ex}')
57 |
58 |
59 | def do_pandas_test(filename, data):
60 | if not os.path.isfile(filename):
61 | data.to_pickle(filename)
62 | truth = pd.read_pickle(filename)
63 | try:
64 | pd.testing.assert_frame_equal(data, truth)
65 | print(f'{filename} pandas test passed')
66 | except AssertionError as ex:
67 | print(f'{filename} pandas test failed {ex}')
68 |
69 |
70 | class SqlLoader:
71 | def __init__(self, connection_string):
72 | engine = create_engine(connection_string)
73 | self.connection = engine.connect()
74 |
75 | def get_passengers(self):
76 | query = """
77 | SELECT
78 | tbl_passengers.pid,
79 | tbl_passengers.pclass,
80 | tbl_passengers.sex,
81 | tbl_passengers.age,
82 | tbl_passengers.parch,
83 | tbl_passengers.sibsp,
84 | tbl_passengers.fare,
85 | tbl_passengers.embarked,
86 | tbl_passengers.name,
87 | tbl_targets.is_survived
88 | FROM
89 | tbl_passengers
90 | JOIN
91 | tbl_targets
92 | ON
93 | tbl_passengers.pid=tbl_targets.pid
94 | """
95 | return pd.read_sql(query, con=self.connection)
96 |
97 |
98 | class TestLoader:
99 | def __init__(self, passengers_filename, real_loader):
100 | self.passengers_filename = passengers_filename
101 | self.real_loader = real_loader
102 | if not os.path.isfile(self.passengers_filename):
103 | df = self.real_loader.get_passengers()
104 | df.to_pickle(self.passengers_filename)
105 |
106 | def get_passengers(self):
107 | return pd.read_pickle(self.passengers_filename)
108 |
109 |
110 | class PassengerLoader:
111 | def __init__(self, loader, rare_titles=None):
112 | self.loader = loader
113 | self.rare_titles = rare_titles
114 |
115 | def get_passengers(self):
116 | passengers = []
117 | for data in self.loader.get_passengers().itertuples():
118 | # parch = Parents/Children, sibsp = Siblings/Spouses
119 | family_size = int(data.parch + data.sibsp)
120 | # Allen, Miss. Elisabeth Walton
121 | title = data.name.split(',')[1].split('.')[0].strip()
122 | passenger = Passenger(
123 | pid=int(data.pid),
124 | pclass=int(data.pclass),
125 | sex=str(data.sex),
126 | age=float(data.age),
127 | family_size=family_size,
128 | fare=float(data.fare),
129 | embarked=str(data.embarked),
130 | is_alone=1 if family_size == 1 else 0,
131 | title='rare' if title in self.rare_titles else title,
132 | is_survived=int(data.is_survived),
133 | )
134 | passengers.append(passenger)
135 | return passengers
136 |
137 |
138 | class TitanicModel:
139 | def __init__(self):
140 | self.one_hot_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
141 | self.knn_imputer = KNNImputer(n_neighbors=5)
142 | self.robust_scaler = RobustScaler()
143 | self.predictor = LogisticRegression(random_state=0)
144 |
145 | def train(self):
146 | pass
147 |
148 | def estimate(self, passengers):
149 | return 1
150 |
151 |
152 | class TitanicModelCreator:
153 | def __init__(self, loader):
154 | self.loader = loader
155 | np.random.seed(42)
156 |
157 | def run(self):
158 | df = pd.DataFrame([v.dict() for v in self.loader.get_passengers()])
159 | targets = [int(v) for v in df['is_survived']]
160 |
161 | X_train, X_test, y_train, y_test = train_test_split(
162 | df, targets, stratify=targets, test_size=0.2
163 | )
164 |
165 | # --- TRAINING ---
166 | model = TitanicModel()
167 |
168 | X_train_categorical = X_train[
169 | ['embarked', 'sex', 'pclass', 'title', 'is_alone']
170 | ]
171 |
172 | model.one_hot_encoder.fit(X_train_categorical)
173 | X_train_categorical_one_hot = model.one_hot_encoder.transform(X_train_categorical)
174 |
175 | X_train_numerical = X_train[['age', 'fare', 'family_size']]
176 | model.knn_imputer.fit(X_train_numerical)
177 | X_train_numerical_imputed = model.knn_imputer.transform(X_train_numerical)
178 |
179 | model.robust_scaler.fit(X_train_numerical_imputed)
180 | X_train_numerical_imputed_scaled = model.robust_scaler.transform(
181 | X_train_numerical_imputed
182 | )
183 |
184 | X_train_processed = np.hstack(
185 | (X_train_categorical_one_hot, X_train_numerical_imputed_scaled)
186 | )
187 |
188 | model.predictor.fit(X_train_processed, y_train)
189 | y_train_estimation = model.predictor.predict(X_train_processed)
190 |
191 | cm_train = confusion_matrix(y_train, y_train_estimation)
192 |
193 | # --- TESTING ---
194 | X_test_categorical = X_test[['embarked', 'sex', 'pclass', 'title', 'is_alone']]
195 | X_test_categorical_one_hot = model.one_hot_encoder.transform(X_test_categorical)
196 |
197 | X_test_numerical = X_test[['age', 'fare', 'family_size']]
198 | X_test_numerical_imputed = model.knn_imputer.transform(X_test_numerical)
199 | X_test_numerical_imputed_scaled = model.robust_scaler.transform(
200 | X_test_numerical_imputed
201 | )
202 |
203 | X_test_processed = np.hstack(
204 | (X_test_categorical_one_hot, X_test_numerical_imputed_scaled)
205 | )
206 |
207 | y_test_estimation = model.predictor.predict(X_test_processed)
208 | cm_test = confusion_matrix(y_test, y_test_estimation)
209 |
210 | print('cm_train', cm_train)
211 | print('cm_test', cm_test)
212 |
213 | do_test('../data/cm_test.pkl', cm_test)
214 | do_test('../data/cm_train.pkl', cm_train)
215 | do_test('../data/X_train_processed.pkl', X_train_processed)
216 | do_test('../data/X_test_processed.pkl', X_test_processed)
217 |
218 | ('../data/df_no_tickets.pkl', df)
219 |
220 |
221 | def main(param: str = 'pass'):
222 | titanic_model_creator = TitanicModelCreator(
223 | loader=PassengerLoader(
224 | loader=SqlLoader(connection_string='sqlite:///../data/titanic.db'),
225 | rare_titles=RARE_TITLES,
226 | )
227 | )
228 | titanic_model_creator.run()
229 |
230 |
231 | def test_main(param: str = 'pass'):
232 | titanic_model_creator = TitanicModelCreator(
233 | loader=PassengerLoader(
234 | loader=TestLoader(
235 | passengers_filename='../data/passengers_with_is_survived.pkl',
236 | real_loader=SqlLoader(connection_string='sqlite:///../data/titanic.db'),
237 | ),
238 | rare_titles=RARE_TITLES,
239 | )
240 | )
241 | titanic_model_creator.run()
242 |
243 |
244 | if __name__ == "__main__":
245 | typer.run(test_main)
246 |
--------------------------------------------------------------------------------
/Step16/titanic_model.sh:
--------------------------------------------------------------------------------
1 | python3.8 ./titanic_model.py $1
2 |
--------------------------------------------------------------------------------
/Step17/README16.md:
--------------------------------------------------------------------------------
1 | ### Step 16: Passenger class based training and evaluation sets
2 |
3 | - Create a function in `TitanicModelCreator` that splits the `passengers` stratified by the "targets" (namely if the passenger survived or not)
4 | - Refactor `X_train/X_test` to be created from these lists of passengers
5 |
6 | Because `train_test_split` works on lists, we extract the pids and the targets from the classes and create the two sets from a mapping from pids to passenger classes.
7 |
--------------------------------------------------------------------------------
/Step17/make_venv.sh:
--------------------------------------------------------------------------------
1 | python3.8 -m venv .venv
2 | source .venv/bin/activate
3 | pip3 install --upgrade pip
4 | pip3 install setuptools==57.1.0
5 | pip3 install wheel
6 | pip3 install -r requirements.txt
7 |
--------------------------------------------------------------------------------
/Step17/requirements.txt:
--------------------------------------------------------------------------------
1 | typer
2 | scikit-learn
3 | jupyter
4 | pandas
5 | numpy
6 | sqlalchemy
7 | pydantic
8 |
--------------------------------------------------------------------------------
/Step17/titanic_model.py:
--------------------------------------------------------------------------------
1 | import os
2 | import pickle
3 | import typer
4 | import numpy as np
5 | import pandas as pd
6 | from pydantic import BaseModel
7 | from sqlalchemy import create_engine
8 |
9 | from sklearn.model_selection import train_test_split
10 | from sklearn.linear_model import LogisticRegression
11 | from sklearn.preprocessing import RobustScaler
12 | from sklearn.preprocessing import OneHotEncoder
13 | from sklearn.impute import KNNImputer
14 | from sklearn.metrics import confusion_matrix
15 |
16 |
17 | RARE_TITLES = {
18 | 'Capt',
19 | 'Col',
20 | 'Don',
21 | 'Dona',
22 | 'Dr',
23 | 'Jonkheer',
24 | 'Lady',
25 | 'Major',
26 | 'Mlle',
27 | 'Mme',
28 | 'Ms',
29 | 'Rev',
30 | 'Sir',
31 | 'the Countess',
32 | }
33 |
34 |
35 | class Passenger(BaseModel):
36 | pid: int
37 | pclass: int
38 | sex: str
39 | age: float
40 | family_size: int
41 | fare: float
42 | embarked: str
43 | is_alone: int
44 | title: str
45 | is_survived: int
46 |
47 |
48 | def do_test(filename, data):
49 | if not os.path.isfile(filename):
50 | pickle.dump(data, open(filename, 'wb'))
51 | truth = pickle.load(open(filename, 'rb'))
52 | try:
53 | np.testing.assert_almost_equal(data, truth)
54 | print(f'{filename} test passed')
55 | except AssertionError as ex:
56 | print(f'{filename} test failed {ex}')
57 |
58 |
59 | def do_pandas_test(filename, data):
60 | if not os.path.isfile(filename):
61 | data.to_pickle(filename)
62 | truth = pd.read_pickle(filename)
63 | try:
64 | pd.testing.assert_frame_equal(data, truth)
65 | print(f'{filename} pandas test passed')
66 | except AssertionError as ex:
67 | print(f'{filename} pandas test failed {ex}')
68 |
69 |
70 | class SqlLoader:
71 | def __init__(self, connection_string):
72 | engine = create_engine(connection_string)
73 | self.connection = engine.connect()
74 |
75 | def get_passengers(self):
76 | query = """
77 | SELECT
78 | tbl_passengers.pid,
79 | tbl_passengers.pclass,
80 | tbl_passengers.sex,
81 | tbl_passengers.age,
82 | tbl_passengers.parch,
83 | tbl_passengers.sibsp,
84 | tbl_passengers.fare,
85 | tbl_passengers.embarked,
86 | tbl_passengers.name,
87 | tbl_targets.is_survived
88 | FROM
89 | tbl_passengers
90 | JOIN
91 | tbl_targets
92 | ON
93 | tbl_passengers.pid=tbl_targets.pid
94 | """
95 | return pd.read_sql(query, con=self.connection)
96 |
97 |
98 | class TestLoader:
99 | def __init__(self, passengers_filename, real_loader):
100 | self.passengers_filename = passengers_filename
101 | self.real_loader = real_loader
102 | if not os.path.isfile(self.passengers_filename):
103 | df = self.real_loader.get_passengers()
104 | df.to_pickle(self.passengers_filename)
105 |
106 | def get_passengers(self):
107 | return pd.read_pickle(self.passengers_filename)
108 |
109 |
110 | class PassengerLoader:
111 | def __init__(self, loader, rare_titles=None):
112 | self.loader = loader
113 | self.rare_titles = rare_titles
114 |
115 | def get_passengers(self):
116 | passengers = []
117 | for data in self.loader.get_passengers().itertuples():
118 | # parch = Parents/Children, sibsp = Siblings/Spouses
119 | family_size = int(data.parch + data.sibsp)
120 | # Allen, Miss. Elisabeth Walton
121 | title = data.name.split(',')[1].split('.')[0].strip()
122 | passenger = Passenger(
123 | pid=int(data.pid),
124 | pclass=int(data.pclass),
125 | sex=str(data.sex),
126 | age=float(data.age),
127 | family_size=family_size,
128 | fare=float(data.fare),
129 | embarked=str(data.embarked),
130 | is_alone=1 if family_size == 1 else 0,
131 | title='rare' if title in self.rare_titles else title,
132 | is_survived=int(data.is_survived),
133 | )
134 | passengers.append(passenger)
135 | return passengers
136 |
137 |
138 | class TitanicModel:
139 | def __init__(self):
140 | self.one_hot_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
141 | self.knn_imputer = KNNImputer(n_neighbors=5)
142 | self.robust_scaler = RobustScaler()
143 | self.predictor = LogisticRegression(random_state=0)
144 |
145 | def train(self):
146 | pass
147 |
148 | def estimate(self, passengers):
149 | return 1
150 |
151 |
152 | class TitanicModelCreator:
153 | def __init__(self, loader):
154 | self.loader = loader
155 | np.random.seed(42)
156 |
157 | def split_passengers(self, passengers):
158 | passengers_map = {p.pid: p for p in passengers}
159 | pids = [passenger.pid for passenger in passengers]
160 | targets = [passenger.is_survived for passenger in passengers]
161 | train_pids, test_pids = train_test_split(pids, stratify=targets, test_size=0.2)
162 | train_passengers = [passengers_map[pid] for pid in train_pids]
163 | test_passengers = [passengers_map[pid] for pid in test_pids]
164 | return train_passengers, test_passengers
165 |
166 | def run(self):
167 | passengers = self.loader.get_passengers()
168 | train_passengers, test_passengers = self.split_passengers(passengers)
169 |
170 | X_train = pd.DataFrame([v.dict() for v in train_passengers])
171 | y_train = [v.is_survived for v in train_passengers]
172 | X_test = pd.DataFrame([v.dict() for v in test_passengers])
173 | y_test = [v.is_survived for v in test_passengers]
174 |
175 | # --- TRAINING ---
176 | model = TitanicModel()
177 |
178 | X_train_categorical = X_train[
179 | ['embarked', 'sex', 'pclass', 'title', 'is_alone']
180 | ]
181 |
182 | model.one_hot_encoder.fit(X_train_categorical)
183 | X_train_categorical_one_hot = model.one_hot_encoder.transform(X_train_categorical)
184 |
185 | X_train_numerical = X_train[['age', 'fare', 'family_size']]
186 | model.knn_imputer.fit(X_train_numerical)
187 | X_train_numerical_imputed = model.knn_imputer.transform(X_train_numerical)
188 |
189 | model.robust_scaler.fit(X_train_numerical_imputed)
190 | X_train_numerical_imputed_scaled = model.robust_scaler.transform(
191 | X_train_numerical_imputed
192 | )
193 |
194 | X_train_processed = np.hstack(
195 | (X_train_categorical_one_hot, X_train_numerical_imputed_scaled)
196 | )
197 |
198 | model.predictor.fit(X_train_processed, y_train)
199 | y_train_estimation = model.predictor.predict(X_train_processed)
200 |
201 | cm_train = confusion_matrix(y_train, y_train_estimation)
202 |
203 | # --- TESTING ---
204 | X_test_categorical = X_test[['embarked', 'sex', 'pclass', 'title', 'is_alone']]
205 | X_test_categorical_one_hot = model.one_hot_encoder.transform(X_test_categorical)
206 |
207 | X_test_numerical = X_test[['age', 'fare', 'family_size']]
208 | X_test_numerical_imputed = model.knn_imputer.transform(X_test_numerical)
209 | X_test_numerical_imputed_scaled = model.robust_scaler.transform(
210 | X_test_numerical_imputed
211 | )
212 |
213 | X_test_processed = np.hstack(
214 | (X_test_categorical_one_hot, X_test_numerical_imputed_scaled)
215 | )
216 |
217 | y_test_estimation = model.predictor.predict(X_test_processed)
218 | cm_test = confusion_matrix(y_test, y_test_estimation)
219 |
220 | print('cm_train', cm_train)
221 | print('cm_test', cm_test)
222 |
223 | do_test('../data/cm_test.pkl', cm_test)
224 | do_test('../data/cm_train.pkl', cm_train)
225 | do_test('../data/X_train_processed.pkl', X_train_processed)
226 | do_test('../data/X_test_processed.pkl', X_test_processed)
227 |
228 | do_pandas_test(
229 | '../data/df_no_tickets.pkl', pd.DataFrame([v.dict() for v in passengers])
230 | )
231 |
232 |
233 | def main(param: str = 'pass'):
234 | titanic_model_creator = TitanicModelCreator(
235 | loader=PassengerLoader(
236 | loader=SqlLoader(connection_string='sqlite:///../data/titanic.db'),
237 | rare_titles=RARE_TITLES,
238 | )
239 | )
240 | titanic_model_creator.run()
241 |
242 |
243 | def test_main(param: str = 'pass'):
244 | titanic_model_creator = TitanicModelCreator(
245 | loader=PassengerLoader(
246 | loader=TestLoader(
247 | passengers_filename='../data/passengers_with_is_survived.pkl',
248 | real_loader=SqlLoader(connection_string='sqlite:///../data/titanic.db'),
249 | ),
250 | rare_titles=RARE_TITLES,
251 | )
252 | )
253 | titanic_model_creator.run()
254 |
255 |
256 | if __name__ == "__main__":
257 | typer.run(test_main)
258 |
--------------------------------------------------------------------------------
/Step17/titanic_model.sh:
--------------------------------------------------------------------------------
1 | python3.8 ./titanic_model.py $1
2 |
--------------------------------------------------------------------------------
/Step18/README17.md:
--------------------------------------------------------------------------------
1 | ### Step 17: Create input processing for `TitanicModel`
2 |
3 | - Move code in `run()` from between instantiating `TitanicModel` and training (`model.predictor.fit`) to the `process_inputs` function of `TitanicModel`.
4 | - Introduce `self.trained` boolean
5 | - Based on `self.trained` either call the `transform` or `fit_transform` of the `sklearn` input processor functions
6 |
7 | All the input transformation code happen twice. Once for training data once for evaluation data. While transforming the data is a responsibility of the model. This is a codesmell called "feature envy". `TitanicModelCreator` envies the functionality from `TitanicModel`. There will be several steps to resolve this. The resulting code will create a self contained model that can be shipped independetly from its creator.
8 |
9 |
--------------------------------------------------------------------------------
/Step18/make_venv.sh:
--------------------------------------------------------------------------------
1 | python3.8 -m venv .venv
2 | source .venv/bin/activate
3 | pip3 install --upgrade pip
4 | pip3 install setuptools==57.1.0
5 | pip3 install wheel
6 | pip3 install -r requirements.txt
7 |
--------------------------------------------------------------------------------
/Step18/requirements.txt:
--------------------------------------------------------------------------------
1 | typer
2 | scikit-learn
3 | jupyter
4 | pandas
5 | numpy
6 | sqlalchemy
7 | pydantic
8 |
--------------------------------------------------------------------------------
/Step18/titanic_model.py:
--------------------------------------------------------------------------------
1 | import os
2 | import pickle
3 | import typer
4 | import numpy as np
5 | import pandas as pd
6 | from pydantic import BaseModel
7 | from sqlalchemy import create_engine
8 |
9 | from sklearn.model_selection import train_test_split
10 | from sklearn.linear_model import LogisticRegression
11 | from sklearn.preprocessing import RobustScaler
12 | from sklearn.preprocessing import OneHotEncoder
13 | from sklearn.impute import KNNImputer
14 | from sklearn.metrics import confusion_matrix
15 |
16 |
17 | RARE_TITLES = {
18 | 'Capt',
19 | 'Col',
20 | 'Don',
21 | 'Dona',
22 | 'Dr',
23 | 'Jonkheer',
24 | 'Lady',
25 | 'Major',
26 | 'Mlle',
27 | 'Mme',
28 | 'Ms',
29 | 'Rev',
30 | 'Sir',
31 | 'the Countess',
32 | }
33 |
34 |
35 | class Passenger(BaseModel):
36 | pid: int
37 | pclass: int
38 | sex: str
39 | age: float
40 | family_size: int
41 | fare: float
42 | embarked: str
43 | is_alone: int
44 | title: str
45 | is_survived: int
46 |
47 |
48 | def do_test(filename, data):
49 | if not os.path.isfile(filename):
50 | pickle.dump(data, open(filename, 'wb'))
51 | truth = pickle.load(open(filename, 'rb'))
52 | try:
53 | np.testing.assert_almost_equal(data, truth)
54 | print(f'{filename} test passed')
55 | except AssertionError as ex:
56 | print(f'{filename} test failed {ex}')
57 |
58 |
59 | def do_pandas_test(filename, data):
60 | if not os.path.isfile(filename):
61 | data.to_pickle(filename)
62 | truth = pd.read_pickle(filename)
63 | try:
64 | pd.testing.assert_frame_equal(data, truth)
65 | print(f'{filename} pandas test passed')
66 | except AssertionError as ex:
67 | print(f'{filename} pandas test failed {ex}')
68 |
69 |
70 | class SqlLoader:
71 | def __init__(self, connection_string):
72 | engine = create_engine(connection_string)
73 | self.connection = engine.connect()
74 |
75 | def get_passengers(self):
76 | query = """
77 | SELECT
78 | tbl_passengers.pid,
79 | tbl_passengers.pclass,
80 | tbl_passengers.sex,
81 | tbl_passengers.age,
82 | tbl_passengers.parch,
83 | tbl_passengers.sibsp,
84 | tbl_passengers.fare,
85 | tbl_passengers.embarked,
86 | tbl_passengers.name,
87 | tbl_targets.is_survived
88 | FROM
89 | tbl_passengers
90 | JOIN
91 | tbl_targets
92 | ON
93 | tbl_passengers.pid=tbl_targets.pid
94 | """
95 | return pd.read_sql(query, con=self.connection)
96 |
97 |
98 | class TestLoader:
99 | def __init__(self, passengers_filename, real_loader):
100 | self.passengers_filename = passengers_filename
101 | self.real_loader = real_loader
102 | if not os.path.isfile(self.passengers_filename):
103 | df = self.real_loader.get_passengers()
104 | df.to_pickle(self.passengers_filename)
105 |
106 | def get_passengers(self):
107 | return pd.read_pickle(self.passengers_filename)
108 |
109 |
110 | class PassengerLoader:
111 | def __init__(self, loader, rare_titles=None):
112 | self.loader = loader
113 | self.rare_titles = rare_titles
114 |
115 | def get_passengers(self):
116 | passengers = []
117 | for data in self.loader.get_passengers().itertuples():
118 | # parch = Parents/Children, sibsp = Siblings/Spouses
119 | family_size = int(data.parch + data.sibsp)
120 | # Allen, Miss. Elisabeth Walton
121 | title = data.name.split(',')[1].split('.')[0].strip()
122 | passenger = Passenger(
123 | pid=int(data.pid),
124 | pclass=int(data.pclass),
125 | sex=str(data.sex),
126 | age=float(data.age),
127 | family_size=family_size,
128 | fare=float(data.fare),
129 | embarked=str(data.embarked),
130 | is_alone=1 if family_size == 1 else 0,
131 | title='rare' if title in self.rare_titles else title,
132 | is_survived=int(data.is_survived),
133 | )
134 | passengers.append(passenger)
135 | return passengers
136 |
137 |
138 | class TitanicModel:
139 | def __init__(self):
140 | self.trained = False
141 | self.one_hot_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
142 | self.knn_imputer = KNNImputer(n_neighbors=5)
143 | self.robust_scaler = RobustScaler()
144 | self.predictor = LogisticRegression(random_state=0)
145 |
146 | def process_inputs(self, passengers):
147 | data = pd.DataFrame([v.dict() for v in passengers])
148 | categorical_data = data[['embarked', 'sex', 'pclass', 'title', 'is_alone']]
149 | numerical_data = data[['age', 'fare', 'family_size']]
150 | if self.trained:
151 | categorical_data = self.one_hot_encoder.transform(categorical_data)
152 | numerical_data = self.robust_scaler.transform(
153 | self.knn_imputer.transform(numerical_data)
154 | )
155 | else:
156 | categorical_data = self.one_hot_encoder.fit_transform(categorical_data)
157 | numerical_data = self.robust_scaler.fit_transform(
158 | self.knn_imputer.fit_transform(numerical_data)
159 | )
160 | return np.hstack((categorical_data, numerical_data))
161 |
162 | def train(self):
163 | pass
164 |
165 | def estimate(self, passengers):
166 | return 1
167 |
168 |
169 | class TitanicModelCreator:
170 | def __init__(self, loader):
171 | self.loader = loader
172 | np.random.seed(42)
173 |
174 | def split_passengers(self, passengers):
175 | passengers_map = {p.pid: p for p in passengers}
176 | pids = [passenger.pid for passenger in passengers]
177 | targets = [passenger.is_survived for passenger in passengers]
178 | train_pids, test_pids = train_test_split(pids, stratify=targets, test_size=0.2)
179 | train_passengers = [passengers_map[pid] for pid in train_pids]
180 | test_passengers = [passengers_map[pid] for pid in test_pids]
181 | return train_passengers, test_passengers
182 |
183 | def run(self):
184 | passengers = self.loader.get_passengers()
185 | train_passengers, test_passengers = self.split_passengers(passengers)
186 |
187 | y_train = [v.is_survived for v in train_passengers]
188 | X_test = pd.DataFrame([v.dict() for v in test_passengers])
189 | y_test = [v.is_survived for v in test_passengers]
190 |
191 | # --- TRAINING ---
192 | model = TitanicModel()
193 |
194 | X_train_processed = model.process_inputs(train_passengers)
195 | model.predictor.fit(X_train_processed, y_train)
196 | y_train_estimation = model.predictor.predict(X_train_processed)
197 |
198 | cm_train = confusion_matrix(y_train, y_train_estimation)
199 |
200 | # --- TESTING ---
201 | X_test_categorical = X_test[['embarked', 'sex', 'pclass', 'title', 'is_alone']]
202 | X_test_categorical_one_hot = model.one_hot_encoder.transform(X_test_categorical)
203 |
204 | X_test_numerical = X_test[['age', 'fare', 'family_size']]
205 | X_test_numerical_imputed = model.knn_imputer.transform(X_test_numerical)
206 | X_test_numerical_imputed_scaled = model.robust_scaler.transform(
207 | X_test_numerical_imputed
208 | )
209 |
210 | X_test_processed = np.hstack(
211 | (X_test_categorical_one_hot, X_test_numerical_imputed_scaled)
212 | )
213 |
214 | y_test_estimation = model.predictor.predict(X_test_processed)
215 | cm_test = confusion_matrix(y_test, y_test_estimation)
216 |
217 | print('cm_train', cm_train)
218 | print('cm_test', cm_test)
219 |
220 | do_test('../data/cm_test.pkl', cm_test)
221 | do_test('../data/cm_train.pkl', cm_train)
222 | do_test('../data/X_train_processed.pkl', X_train_processed)
223 | do_test('../data/X_test_processed.pkl', X_test_processed)
224 |
225 | do_pandas_test(
226 | '../data/df_no_tickets.pkl', pd.DataFrame([v.dict() for v in passengers])
227 | )
228 |
229 |
230 | def main(param: str = 'pass'):
231 | titanic_model_creator = TitanicModelCreator(
232 | loader=PassengerLoader(
233 | loader=SqlLoader(connection_string='sqlite:///../data/titanic.db'),
234 | rare_titles=RARE_TITLES,
235 | )
236 | )
237 | titanic_model_creator.run()
238 |
239 |
240 | def test_main(param: str = 'pass'):
241 | titanic_model_creator = TitanicModelCreator(
242 | loader=PassengerLoader(
243 | loader=TestLoader(
244 | passengers_filename='../data/passengers_with_is_survived.pkl',
245 | real_loader=SqlLoader(connection_string='sqlite:///../data/titanic.db'),
246 | ),
247 | rare_titles=RARE_TITLES,
248 | )
249 | )
250 | titanic_model_creator.run()
251 |
252 |
253 | if __name__ == "__main__":
254 | typer.run(test_main)
255 |
--------------------------------------------------------------------------------
/Step18/titanic_model.sh:
--------------------------------------------------------------------------------
1 | python3.8 ./titanic_model.py $1
2 |
--------------------------------------------------------------------------------
/Step19/README18.md:
--------------------------------------------------------------------------------
1 | ### Step 18: Move training into `TitanicModel`
2 |
3 | - Use the same interface as `process_inputs` with `train()`
4 | - Process the data with `process_inputs` (just pass through the arguments)
5 | - Recreate the required targets with the mapping
6 | - Train the model and set the `trained` boolean to `True`
7 |
--------------------------------------------------------------------------------
/Step19/make_venv.sh:
--------------------------------------------------------------------------------
1 | python3.8 -m venv .venv
2 | source .venv/bin/activate
3 | pip3 install --upgrade pip
4 | pip3 install setuptools==57.1.0
5 | pip3 install wheel
6 | pip3 install -r requirements.txt
7 |
--------------------------------------------------------------------------------
/Step19/requirements.txt:
--------------------------------------------------------------------------------
1 | typer
2 | scikit-learn
3 | jupyter
4 | pandas
5 | numpy
6 | sqlalchemy
7 | pydantic
8 |
--------------------------------------------------------------------------------
/Step19/titanic_model.py:
--------------------------------------------------------------------------------
1 | import os
2 | import pickle
3 | import typer
4 | import numpy as np
5 | import pandas as pd
6 | from pydantic import BaseModel
7 | from sqlalchemy import create_engine
8 |
9 | from sklearn.model_selection import train_test_split
10 | from sklearn.linear_model import LogisticRegression
11 | from sklearn.preprocessing import RobustScaler
12 | from sklearn.preprocessing import OneHotEncoder
13 | from sklearn.impute import KNNImputer
14 | from sklearn.metrics import confusion_matrix
15 |
16 |
17 | RARE_TITLES = {
18 | 'Capt',
19 | 'Col',
20 | 'Don',
21 | 'Dona',
22 | 'Dr',
23 | 'Jonkheer',
24 | 'Lady',
25 | 'Major',
26 | 'Mlle',
27 | 'Mme',
28 | 'Ms',
29 | 'Rev',
30 | 'Sir',
31 | 'the Countess',
32 | }
33 |
34 |
35 | class Passenger(BaseModel):
36 | pid: int
37 | pclass: int
38 | sex: str
39 | age: float
40 | family_size: int
41 | fare: float
42 | embarked: str
43 | is_alone: int
44 | title: str
45 | is_survived: int
46 |
47 |
48 | def do_test(filename, data):
49 | if not os.path.isfile(filename):
50 | pickle.dump(data, open(filename, 'wb'))
51 | truth = pickle.load(open(filename, 'rb'))
52 | try:
53 | np.testing.assert_almost_equal(data, truth)
54 | print(f'{filename} test passed')
55 | except AssertionError as ex:
56 | print(f'{filename} test failed {ex}')
57 |
58 |
59 | def do_pandas_test(filename, data):
60 | if not os.path.isfile(filename):
61 | data.to_pickle(filename)
62 | truth = pd.read_pickle(filename)
63 | try:
64 | pd.testing.assert_frame_equal(data, truth)
65 | print(f'{filename} pandas test passed')
66 | except AssertionError as ex:
67 | print(f'{filename} pandas test failed {ex}')
68 |
69 |
70 | class SqlLoader:
71 | def __init__(self, connection_string):
72 | engine = create_engine(connection_string)
73 | self.connection = engine.connect()
74 |
75 | def get_passengers(self):
76 | query = """
77 | SELECT
78 | tbl_passengers.pid,
79 | tbl_passengers.pclass,
80 | tbl_passengers.sex,
81 | tbl_passengers.age,
82 | tbl_passengers.parch,
83 | tbl_passengers.sibsp,
84 | tbl_passengers.fare,
85 | tbl_passengers.embarked,
86 | tbl_passengers.name,
87 | tbl_targets.is_survived
88 | FROM
89 | tbl_passengers
90 | JOIN
91 | tbl_targets
92 | ON
93 | tbl_passengers.pid=tbl_targets.pid
94 | """
95 | return pd.read_sql(query, con=self.connection)
96 |
97 |
98 | class TestLoader:
99 | def __init__(self, passengers_filename, real_loader):
100 | self.passengers_filename = passengers_filename
101 | self.real_loader = real_loader
102 | if not os.path.isfile(self.passengers_filename):
103 | df = self.real_loader.get_passengers()
104 | df.to_pickle(self.passengers_filename)
105 |
106 | def get_passengers(self):
107 | return pd.read_pickle(self.passengers_filename)
108 |
109 |
110 | class PassengerLoader:
111 | def __init__(self, loader, rare_titles=None):
112 | self.loader = loader
113 | self.rare_titles = rare_titles
114 |
115 | def get_passengers(self):
116 | passengers = []
117 | for data in self.loader.get_passengers().itertuples():
118 | # parch = Parents/Children, sibsp = Siblings/Spouses
119 | family_size = int(data.parch + data.sibsp)
120 | # Allen, Miss. Elisabeth Walton
121 | title = data.name.split(',')[1].split('.')[0].strip()
122 | passenger = Passenger(
123 | pid=int(data.pid),
124 | pclass=int(data.pclass),
125 | sex=str(data.sex),
126 | age=float(data.age),
127 | family_size=family_size,
128 | fare=float(data.fare),
129 | embarked=str(data.embarked),
130 | is_alone=1 if family_size == 1 else 0,
131 | title='rare' if title in self.rare_titles else title,
132 | is_survived=int(data.is_survived),
133 | )
134 | passengers.append(passenger)
135 | return passengers
136 |
137 |
138 | class TitanicModel:
139 | def __init__(self):
140 | self.trained = False
141 | self.one_hot_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
142 | self.knn_imputer = KNNImputer(n_neighbors=5)
143 | self.robust_scaler = RobustScaler()
144 | self.predictor = LogisticRegression(random_state=0)
145 |
146 | def process_inputs(self, passengers):
147 | data = pd.DataFrame([v.dict() for v in passengers])
148 | categorical_data = data[['embarked', 'sex', 'pclass', 'title', 'is_alone']]
149 | numerical_data = data[['age', 'fare', 'family_size']]
150 | if self.trained:
151 | categorical_data = self.one_hot_encoder.transform(categorical_data)
152 | numerical_data = self.robust_scaler.transform(
153 | self.knn_imputer.transform(numerical_data)
154 | )
155 | else:
156 | categorical_data = self.one_hot_encoder.fit_transform(categorical_data)
157 | numerical_data = self.robust_scaler.fit_transform(
158 | self.knn_imputer.fit_transform(numerical_data)
159 | )
160 | return np.hstack((categorical_data, numerical_data))
161 |
162 | def train(self, passengers):
163 | targets = [v.is_survived for v in passengers]
164 | inputs = self.process_inputs(passengers)
165 | self.predictor.fit(inputs, targets)
166 | self.trained = True
167 |
168 | def estimate(self, passengers):
169 | return 1
170 |
171 |
172 | class TitanicModelCreator:
173 | def __init__(self, loader):
174 | self.loader = loader
175 | np.random.seed(42)
176 |
177 | def split_passengers(self, passengers):
178 | passengers_map = {p.pid: p for p in passengers}
179 | pids = [passenger.pid for passenger in passengers]
180 | targets = [passenger.is_survived for passenger in passengers]
181 | train_pids, test_pids = train_test_split(pids, stratify=targets, test_size=0.2)
182 | train_passengers = [passengers_map[pid] for pid in train_pids]
183 | test_passengers = [passengers_map[pid] for pid in test_pids]
184 | return train_passengers, test_passengers
185 |
186 | def run(self):
187 | passengers = self.loader.get_passengers()
188 | train_passengers, test_passengers = self.split_passengers(passengers)
189 |
190 | y_train = [v.is_survived for v in train_passengers]
191 | X_test = pd.DataFrame([v.dict() for v in test_passengers])
192 | y_test = [v.is_survived for v in test_passengers]
193 |
194 | # --- TRAINING ---
195 | model = TitanicModel()
196 | model.train(train_passengers)
197 |
198 | X_train_processed = model.process_inputs(train_passengers)
199 | y_train_estimation = model.predictor.predict(X_train_processed)
200 | cm_train = confusion_matrix(y_train, y_train_estimation)
201 |
202 | # --- TESTING ---
203 | X_test_categorical = X_test[['embarked', 'sex', 'pclass', 'title', 'is_alone']]
204 | X_test_categorical_one_hot = model.one_hot_encoder.transform(X_test_categorical)
205 |
206 | X_test_numerical = X_test[['age', 'fare', 'family_size']]
207 | X_test_numerical_imputed = model.knn_imputer.transform(X_test_numerical)
208 | X_test_numerical_imputed_scaled = model.robust_scaler.transform(
209 | X_test_numerical_imputed
210 | )
211 |
212 | X_test_processed = np.hstack(
213 | (X_test_categorical_one_hot, X_test_numerical_imputed_scaled)
214 | )
215 |
216 | y_test_estimation = model.predictor.predict(X_test_processed)
217 | cm_test = confusion_matrix(y_test, y_test_estimation)
218 |
219 | print('cm_train', cm_train)
220 | print('cm_test', cm_test)
221 |
222 | do_test('../data/cm_test.pkl', cm_test)
223 | do_test('../data/cm_train.pkl', cm_train)
224 | do_test('../data/X_train_processed.pkl', X_train_processed)
225 | do_test('../data/X_test_processed.pkl', X_test_processed)
226 |
227 | do_pandas_test(
228 | '../data/df_no_tickets.pkl', pd.DataFrame([v.dict() for v in passengers])
229 | )
230 |
231 |
232 | def main(param: str = 'pass'):
233 | titanic_model_creator = TitanicModelCreator(
234 | loader=PassengerLoader(
235 | loader=SqlLoader(connection_string='sqlite:///../data/titanic.db'),
236 | rare_titles=RARE_TITLES,
237 | )
238 | )
239 | titanic_model_creator.run()
240 |
241 |
242 | def test_main(param: str = 'pass'):
243 | titanic_model_creator = TitanicModelCreator(
244 | loader=PassengerLoader(
245 | loader=TestLoader(
246 | passengers_filename='../data/passengers_with_is_survived.pkl',
247 | real_loader=SqlLoader(connection_string='sqlite:///../data/titanic.db'),
248 | ),
249 | rare_titles=RARE_TITLES,
250 | )
251 | )
252 | titanic_model_creator.run()
253 |
254 |
255 | if __name__ == "__main__":
256 | typer.run(test_main)
257 |
--------------------------------------------------------------------------------
/Step19/titanic_model.sh:
--------------------------------------------------------------------------------
1 | python3.8 ./titanic_model.py $1
2 |
--------------------------------------------------------------------------------
/Step20/README19.md:
--------------------------------------------------------------------------------
1 | ### Step 19: Move prediction to `TitanicModel`
2 |
3 | - Create the `estimate` function
4 | - Call `proccess_inputs` and `predictor.predict` in it
5 | - Remove all evaluation input processing code
6 | - Call `estimate` from `run`
7 |
8 | Because there was no separation of concerns the input processing code was duplicated and now that we moved it to its own location it can be removed.
9 |
10 | `X_train_processed` and `X_test_processed` do not exist any more so to pass the tests they need to be recreated. This is a good point to think about why this is necessary and find a different way to test behaviour. To keep the project short we set aside this but this would be a good place to introduce more tests.
11 |
--------------------------------------------------------------------------------
/Step20/make_venv.sh:
--------------------------------------------------------------------------------
1 | python3.8 -m venv .venv
2 | source .venv/bin/activate
3 | pip3 install --upgrade pip
4 | pip3 install setuptools==57.1.0
5 | pip3 install wheel
6 | pip3 install -r requirements.txt
7 |
--------------------------------------------------------------------------------
/Step20/requirements.txt:
--------------------------------------------------------------------------------
1 | typer
2 | scikit-learn
3 | jupyter
4 | pandas
5 | numpy
6 | sqlalchemy
7 | pydantic
8 |
--------------------------------------------------------------------------------
/Step20/titanic_model.py:
--------------------------------------------------------------------------------
1 | import os
2 | import pickle
3 | import typer
4 | import numpy as np
5 | import pandas as pd
6 | from pydantic import BaseModel
7 | from sqlalchemy import create_engine
8 |
9 | from sklearn.model_selection import train_test_split
10 | from sklearn.linear_model import LogisticRegression
11 | from sklearn.preprocessing import RobustScaler
12 | from sklearn.preprocessing import OneHotEncoder
13 | from sklearn.impute import KNNImputer
14 | from sklearn.metrics import confusion_matrix
15 |
16 |
17 | RARE_TITLES = {
18 | 'Capt',
19 | 'Col',
20 | 'Don',
21 | 'Dona',
22 | 'Dr',
23 | 'Jonkheer',
24 | 'Lady',
25 | 'Major',
26 | 'Mlle',
27 | 'Mme',
28 | 'Ms',
29 | 'Rev',
30 | 'Sir',
31 | 'the Countess',
32 | }
33 |
34 |
35 | class Passenger(BaseModel):
36 | pid: int
37 | pclass: int
38 | sex: str
39 | age: float
40 | family_size: int
41 | fare: float
42 | embarked: str
43 | is_alone: int
44 | title: str
45 | is_survived: int
46 |
47 |
48 | def do_test(filename, data):
49 | if not os.path.isfile(filename):
50 | pickle.dump(data, open(filename, 'wb'))
51 | truth = pickle.load(open(filename, 'rb'))
52 | try:
53 | np.testing.assert_almost_equal(data, truth)
54 | print(f'{filename} test passed')
55 | except AssertionError as ex:
56 | print(f'{filename} test failed {ex}')
57 |
58 |
59 | def do_pandas_test(filename, data):
60 | if not os.path.isfile(filename):
61 | data.to_pickle(filename)
62 | truth = pd.read_pickle(filename)
63 | try:
64 | pd.testing.assert_frame_equal(data, truth)
65 | print(f'{filename} pandas test passed')
66 | except AssertionError as ex:
67 | print(f'{filename} pandas test failed {ex}')
68 |
69 |
70 | class SqlLoader:
71 | def __init__(self, connection_string):
72 | engine = create_engine(connection_string)
73 | self.connection = engine.connect()
74 |
75 | def get_passengers(self):
76 | query = """
77 | SELECT
78 | tbl_passengers.pid,
79 | tbl_passengers.pclass,
80 | tbl_passengers.sex,
81 | tbl_passengers.age,
82 | tbl_passengers.parch,
83 | tbl_passengers.sibsp,
84 | tbl_passengers.fare,
85 | tbl_passengers.embarked,
86 | tbl_passengers.name,
87 | tbl_targets.is_survived
88 | FROM
89 | tbl_passengers
90 | JOIN
91 | tbl_targets
92 | ON
93 | tbl_passengers.pid=tbl_targets.pid
94 | """
95 | return pd.read_sql(query, con=self.connection)
96 |
97 |
98 | class TestLoader:
99 | def __init__(self, passengers_filename, real_loader):
100 | self.passengers_filename = passengers_filename
101 | self.real_loader = real_loader
102 | if not os.path.isfile(self.passengers_filename):
103 | df = self.real_loader.get_passengers()
104 | df.to_pickle(self.passengers_filename)
105 |
106 | def get_passengers(self):
107 | return pd.read_pickle(self.passengers_filename)
108 |
109 |
110 | class PassengerLoader:
111 | def __init__(self, loader, rare_titles=None):
112 | self.loader = loader
113 | self.rare_titles = rare_titles
114 |
115 | def get_passengers(self):
116 | passengers = []
117 | for data in self.loader.get_passengers().itertuples():
118 | # parch = Parents/Children, sibsp = Siblings/Spouses
119 | family_size = int(data.parch + data.sibsp)
120 | # Allen, Miss. Elisabeth Walton
121 | title = data.name.split(',')[1].split('.')[0].strip()
122 | passenger = Passenger(
123 | pid=int(data.pid),
124 | pclass=int(data.pclass),
125 | sex=str(data.sex),
126 | age=float(data.age),
127 | family_size=family_size,
128 | fare=float(data.fare),
129 | embarked=str(data.embarked),
130 | is_alone=1 if family_size == 1 else 0,
131 | title='rare' if title in self.rare_titles else title,
132 | is_survived=int(data.is_survived),
133 | )
134 | passengers.append(passenger)
135 | return passengers
136 |
137 |
138 | class TitanicModel:
139 | def __init__(self):
140 | self.trained = False
141 | self.one_hot_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
142 | self.knn_imputer = KNNImputer(n_neighbors=5)
143 | self.robust_scaler = RobustScaler()
144 | self.predictor = LogisticRegression(random_state=0)
145 |
146 | def process_inputs(self, passengers):
147 | data = pd.DataFrame([v.dict() for v in passengers])
148 | categorical_data = data[['embarked', 'sex', 'pclass', 'title', 'is_alone']]
149 | numerical_data = data[['age', 'fare', 'family_size']]
150 | if self.trained:
151 | categorical_data = self.one_hot_encoder.transform(categorical_data)
152 | numerical_data = self.robust_scaler.transform(
153 | self.knn_imputer.transform(numerical_data)
154 | )
155 | else:
156 | categorical_data = self.one_hot_encoder.fit_transform(categorical_data)
157 | numerical_data = self.robust_scaler.fit_transform(
158 | self.knn_imputer.fit_transform(numerical_data)
159 | )
160 | return np.hstack((categorical_data, numerical_data))
161 |
162 | def train(self, passengers):
163 | targets = [v.is_survived for v in passengers]
164 | inputs = self.process_inputs(passengers)
165 | self.predictor.fit(inputs, targets)
166 | self.trained = True
167 |
168 | def estimate(self, passengers):
169 | inputs = self.process_inputs(passengers)
170 | return self.predictor.predict(inputs)
171 |
172 |
173 | class TitanicModelCreator:
174 | def __init__(self, loader):
175 | self.loader = loader
176 | np.random.seed(42)
177 |
178 | def split_passengers(self, passengers):
179 | passengers_map = {p.pid: p for p in passengers}
180 | pids = [passenger.pid for passenger in passengers]
181 | targets = [passenger.is_survived for passenger in passengers]
182 | train_pids, test_pids = train_test_split(pids, stratify=targets, test_size=0.2)
183 | train_passengers = [passengers_map[pid] for pid in train_pids]
184 | test_passengers = [passengers_map[pid] for pid in test_pids]
185 | return train_passengers, test_passengers
186 |
187 | def run(self):
188 | passengers = self.loader.get_passengers()
189 | train_passengers, test_passengers = self.split_passengers(passengers)
190 |
191 | # --- TRAINING ---
192 | model = TitanicModel()
193 | model.train(train_passengers)
194 | y_train_estimation = model.estimate(train_passengers)
195 | cm_train = confusion_matrix(
196 | [v.is_survived for v in train_passengers], y_train_estimation
197 | )
198 |
199 | # --- TESTING ---
200 | y_test_estimation = model.estimate(test_passengers)
201 | cm_test = confusion_matrix(
202 | [v.is_survived for v in test_passengers], y_test_estimation
203 | )
204 |
205 | print('cm_train', cm_train)
206 | print('cm_test', cm_test)
207 |
208 | do_test('../data/cm_test.pkl', cm_test)
209 | do_test('../data/cm_train.pkl', cm_train)
210 | X_train_processed = model.process_inputs(train_passengers)
211 | do_test('../data/X_train_processed.pkl', X_train_processed)
212 | X_test_processed = model.process_inputs(test_passengers)
213 | do_test('../data/X_test_processed.pkl', X_test_processed)
214 |
215 | do_pandas_test(
216 | '../data/df_no_tickets.pkl', pd.DataFrame([v.dict() for v in passengers])
217 | )
218 |
219 |
220 | def main(param: str = 'pass'):
221 | titanic_model_creator = TitanicModelCreator(
222 | loader=PassengerLoader(
223 | loader=SqlLoader(connection_string='sqlite:///../data/titanic.db'),
224 | rare_titles=RARE_TITLES,
225 | )
226 | )
227 | titanic_model_creator.run()
228 |
229 |
230 | def test_main(param: str = 'pass'):
231 | titanic_model_creator = TitanicModelCreator(
232 | loader=PassengerLoader(
233 | loader=TestLoader(
234 | passengers_filename='../data/passengers_with_is_survived.pkl',
235 | real_loader=SqlLoader(connection_string='sqlite:///../data/titanic.db'),
236 | ),
237 | rare_titles=RARE_TITLES,
238 | )
239 | )
240 | titanic_model_creator.run()
241 |
242 |
243 | if __name__ == "__main__":
244 | typer.run(test_main)
245 |
--------------------------------------------------------------------------------
/Step20/titanic_model.sh:
--------------------------------------------------------------------------------
1 | python3.8 ./titanic_model.py $1
2 |
--------------------------------------------------------------------------------
/Step21/README20.md:
--------------------------------------------------------------------------------
1 | ### Step 20: Save model and move tests to custom model savers
2 |
3 | - Create `ModelSaver` that has a `save_model` interface that accepts a model and a result object
4 | - Pickle the model and the result to a file
5 | - Create `TestModelSaver` that has the same interface
6 | - Move the testing code to the `save_model` function
7 | - Add `model_saver` property to `TitanicModelCreator` and call it after the evaluation code
8 | - Add an instance of `ModelSaver` and `TestModelSaver` respectively in `main` and `test_main` to the construction of `TitanicModelCreator`
9 |
10 | Currently `TitanicModelCreator` contains its own testing, while this is intended to run in production. It also have no way to save the model. We will introduce the concept of `ModelSaver` here, anything that need to be preserved after the model training need to be passed to this class.
11 |
12 | We will also move testing into a specific `TestModelSaver` that will instead of saving the model, will run the tests that were otherwise be in `run()`. This way the same code can run in production and in testing without change.
13 |
--------------------------------------------------------------------------------
/Step21/make_venv.sh:
--------------------------------------------------------------------------------
1 | python3.8 -m venv .venv
2 | source .venv/bin/activate
3 | pip3 install --upgrade pip
4 | pip3 install setuptools==57.1.0
5 | pip3 install wheel
6 | pip3 install -r requirements.txt
7 |
--------------------------------------------------------------------------------
/Step21/requirements.txt:
--------------------------------------------------------------------------------
1 | typer
2 | scikit-learn
3 | jupyter
4 | pandas
5 | numpy
6 | sqlalchemy
7 | pydantic
8 |
--------------------------------------------------------------------------------
/Step21/titanic_model.py:
--------------------------------------------------------------------------------
1 | import os
2 | import pickle
3 | import typer
4 | import numpy as np
5 | import pandas as pd
6 | from pydantic import BaseModel
7 | from sqlalchemy import create_engine
8 |
9 | from sklearn.model_selection import train_test_split
10 | from sklearn.linear_model import LogisticRegression
11 | from sklearn.preprocessing import RobustScaler
12 | from sklearn.preprocessing import OneHotEncoder
13 | from sklearn.impute import KNNImputer
14 | from sklearn.metrics import confusion_matrix
15 |
16 |
17 | RARE_TITLES = {
18 | 'Capt',
19 | 'Col',
20 | 'Don',
21 | 'Dona',
22 | 'Dr',
23 | 'Jonkheer',
24 | 'Lady',
25 | 'Major',
26 | 'Mlle',
27 | 'Mme',
28 | 'Ms',
29 | 'Rev',
30 | 'Sir',
31 | 'the Countess',
32 | }
33 |
34 |
35 | class Passenger(BaseModel):
36 | pid: int
37 | pclass: int
38 | sex: str
39 | age: float
40 | family_size: int
41 | fare: float
42 | embarked: str
43 | is_alone: int
44 | title: str
45 | is_survived: int
46 |
47 |
48 | def do_test(filename, data):
49 | if not os.path.isfile(filename):
50 | pickle.dump(data, open(filename, 'wb'))
51 | truth = pickle.load(open(filename, 'rb'))
52 | try:
53 | np.testing.assert_almost_equal(data, truth)
54 | print(f'{filename} test passed')
55 | except AssertionError as ex:
56 | print(f'{filename} test failed {ex}')
57 |
58 |
59 | def do_pandas_test(filename, data):
60 | if not os.path.isfile(filename):
61 | data.to_pickle(filename)
62 | truth = pd.read_pickle(filename)
63 | try:
64 | pd.testing.assert_frame_equal(data, truth)
65 | print(f'{filename} pandas test passed')
66 | except AssertionError as ex:
67 | print(f'{filename} pandas test failed {ex}')
68 |
69 |
70 | class SqlLoader:
71 | def __init__(self, connection_string):
72 | engine = create_engine(connection_string)
73 | self.connection = engine.connect()
74 |
75 | def get_passengers(self):
76 | query = """
77 | SELECT
78 | tbl_passengers.pid,
79 | tbl_passengers.pclass,
80 | tbl_passengers.sex,
81 | tbl_passengers.age,
82 | tbl_passengers.parch,
83 | tbl_passengers.sibsp,
84 | tbl_passengers.fare,
85 | tbl_passengers.embarked,
86 | tbl_passengers.name,
87 | tbl_targets.is_survived
88 | FROM
89 | tbl_passengers
90 | JOIN
91 | tbl_targets
92 | ON
93 | tbl_passengers.pid=tbl_targets.pid
94 | """
95 | return pd.read_sql(query, con=self.connection)
96 |
97 |
98 | class TestLoader:
99 | def __init__(self, passengers_filename, real_loader):
100 | self.passengers_filename = passengers_filename
101 | self.real_loader = real_loader
102 | if not os.path.isfile(self.passengers_filename):
103 | df = self.real_loader.get_passengers()
104 | df.to_pickle(self.passengers_filename)
105 |
106 | def get_passengers(self):
107 | return pd.read_pickle(self.passengers_filename)
108 |
109 |
110 | class ModelSaver:
111 | def __init__(self, model_filename, result_filename):
112 | self.model_filename = model_filename
113 | self.result_filename = result_filename
114 |
115 | def save_model(self, model, result):
116 | pickle.dump(model, open(self.filename, 'wb'))
117 | pickle.dump(result, open(self.result_filename, 'wb'))
118 |
119 |
120 | class TestModelSaver:
121 | def __init__(self):
122 | pass
123 |
124 | def save_model(self, model, result):
125 | do_test('../data/cm_test.pkl', result['cm_test'])
126 | do_test('../data/cm_train.pkl', result['cm_train'])
127 | X_train_processed = model.process_inputs(result['train_passengers'])
128 | do_test('../data/X_train_processed.pkl', X_train_processed)
129 | X_test_processed = model.process_inputs(result['test_passengers'])
130 | do_test('../data/X_test_processed.pkl', X_test_processed)
131 |
132 |
133 | class PassengerLoader:
134 | def __init__(self, loader, rare_titles=None):
135 | self.loader = loader
136 | self.rare_titles = rare_titles
137 |
138 | def get_passengers(self):
139 | passengers = []
140 | for data in self.loader.get_passengers().itertuples():
141 | # parch = Parents/Children, sibsp = Siblings/Spouses
142 | family_size = int(data.parch + data.sibsp)
143 | # Allen, Miss. Elisabeth Walton
144 | title = data.name.split(',')[1].split('.')[0].strip()
145 | passenger = Passenger(
146 | pid=int(data.pid),
147 | pclass=int(data.pclass),
148 | sex=str(data.sex),
149 | age=float(data.age),
150 | family_size=family_size,
151 | fare=float(data.fare),
152 | embarked=str(data.embarked),
153 | is_alone=1 if family_size == 1 else 0,
154 | title='rare' if title in self.rare_titles else title,
155 | is_survived=int(data.is_survived),
156 | )
157 | passengers.append(passenger)
158 | return passengers
159 |
160 |
161 | class TitanicModel:
162 | def __init__(self):
163 | self.trained = False
164 | self.one_hot_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
165 | self.knn_imputer = KNNImputer(n_neighbors=5)
166 | self.robust_scaler = RobustScaler()
167 | self.predictor = LogisticRegression(random_state=0)
168 |
169 | def process_inputs(self, passengers):
170 | data = pd.DataFrame([v.dict() for v in passengers])
171 | categorical_data = data[['embarked', 'sex', 'pclass', 'title', 'is_alone']]
172 | numerical_data = data[['age', 'fare', 'family_size']]
173 | if self.trained:
174 | categorical_data = self.one_hot_encoder.transform(categorical_data)
175 | numerical_data = self.robust_scaler.transform(
176 | self.knn_imputer.transform(numerical_data)
177 | )
178 | else:
179 | categorical_data = self.one_hot_encoder.fit_transform(categorical_data)
180 | numerical_data = self.robust_scaler.fit_transform(
181 | self.knn_imputer.fit_transform(numerical_data)
182 | )
183 | return np.hstack((categorical_data, numerical_data))
184 |
185 | def train(self, passengers):
186 | targets = [v.is_survived for v in passengers]
187 | inputs = self.process_inputs(passengers)
188 | self.predictor.fit(inputs, targets)
189 | self.trained = True
190 |
191 | def estimate(self, passengers):
192 | inputs = self.process_inputs(passengers)
193 | return self.predictor.predict(inputs)
194 |
195 |
196 | class TitanicModelCreator:
197 | def __init__(self, loader, model_saver):
198 | self.loader = loader
199 | self.model_saver = model_saver
200 | np.random.seed(42)
201 |
202 | def split_passengers(self, passengers):
203 | passengers_map = {p.pid: p for p in passengers}
204 | pids = [passenger.pid for passenger in passengers]
205 | targets = [passenger.is_survived for passenger in passengers]
206 | train_pids, test_pids = train_test_split(pids, stratify=targets, test_size=0.2)
207 | train_passengers = [passengers_map[pid] for pid in train_pids]
208 | test_passengers = [passengers_map[pid] for pid in test_pids]
209 | return train_passengers, test_passengers
210 |
211 | def run(self):
212 | passengers = self.loader.get_passengers()
213 | train_passengers, test_passengers = self.split_passengers(passengers)
214 |
215 | # --- TRAINING ---
216 | model = TitanicModel()
217 | model.train(train_passengers)
218 | y_train_estimation = model.estimate(train_passengers)
219 | cm_train = confusion_matrix(
220 | [v.is_survived for v in train_passengers], y_train_estimation
221 | )
222 |
223 | # --- TESTING ---
224 | y_test_estimation = model.estimate(test_passengers)
225 | cm_test = confusion_matrix(
226 | [v.is_survived for v in test_passengers], y_test_estimation
227 | )
228 |
229 | self.model_saver.save_model(
230 | model=model,
231 | result={
232 | 'cm_train': cm_train,
233 | 'cm_test': cm_test,
234 | 'train_passengers': train_passengers,
235 | 'test_passengers': test_passengers,
236 | },
237 | )
238 |
239 |
240 | def main(param: str = 'pass'):
241 | titanic_model_creator = TitanicModelCreator(
242 | loader=PassengerLoader(
243 | loader=SqlLoader(connection_string='sqlite:///../data/titanic.db'),
244 | rare_titles=RARE_TITLES,
245 | ),
246 | model_saver=ModelSaver(
247 | model_filename='../data/real_model.pkl',
248 | result_filename='../data/real_result.pkl',
249 | ),
250 | )
251 | titanic_model_creator.run()
252 |
253 |
254 | def test_main(param: str = 'pass'):
255 | titanic_model_creator = TitanicModelCreator(
256 | loader=PassengerLoader(
257 | loader=TestLoader(
258 | passengers_filename='../data/passengers_with_is_survived.pkl',
259 | real_loader=SqlLoader(connection_string='sqlite:///../data/titanic.db'),
260 | ),
261 | rare_titles=RARE_TITLES,
262 | ),
263 | model_saver=TestModelSaver(),
264 | )
265 | titanic_model_creator.run()
266 |
267 |
268 | if __name__ == "__main__":
269 | typer.run(test_main)
270 |
--------------------------------------------------------------------------------
/Step21/titanic_model.sh:
--------------------------------------------------------------------------------
1 | python3.8 ./titanic_model.py $1
2 |
--------------------------------------------------------------------------------
/Step22/README21.md:
--------------------------------------------------------------------------------
1 | ### Step 21: Enable training of different models
2 |
3 | - Add `model` property to `TitanicModelCreator` and use in `run()` that instead of the local `TitanicModel` instance.
4 | - Add `TitanicModel` instantiation to the creation of `TitanicModelCreator` in both `main` and `test_main`
5 | - Expose parts of `TitanicModel` (predictor, processing parameter)
6 |
7 | At this point the refactoring is pretty much finished. This last step enables the creation of different models. Use existing implementations as templates to create new shell scripts, main functions (contexts) for each experiment that uses new Loaders to create new datasets. Write different test context to make sure the changes you do are as intended.As more experiments emerge, you will see patterns and opportunities to extract common behaviour from similar implementation while still maintaining validity through thes tests. This allows restructuring your code on the fly and find out what is the most convenient architecture for your system. Most problems in these systems are unforeseeable, there is no possibility to figure out the best structure before you start implementation. This require a workflow that enables radical changes even at later stages of the project. Clean Architecture, end-to-end testing and maintaining code quality provides exactly this feature at very low effort.
8 |
9 | Next steps:
10 |
11 | - Use different data:
12 | - Update `SqlLoader` to retrieve different data
13 | - Update `Passenger` class to contain this new data
14 | - Update `PassengerLoader` class to process this new data into the classes
15 | - Update `process_inputs` to create features out of this new data
16 | - Use different features
17 | - Update `process_inputs` in `TitanicModel`, expose parameters as needed
18 | - Use different model:
19 | - Use different `predictor` in `TitanicModel`
20 |
--------------------------------------------------------------------------------
/Step22/make_venv.sh:
--------------------------------------------------------------------------------
1 | python3.8 -m venv .venv
2 | source .venv/bin/activate
3 | pip3 install --upgrade pip
4 | pip3 install setuptools==57.1.0
5 | pip3 install wheel
6 | pip3 install -r requirements.txt
7 |
--------------------------------------------------------------------------------
/Step22/requirements.txt:
--------------------------------------------------------------------------------
1 | typer
2 | scikit-learn
3 | jupyter
4 | pandas
5 | numpy
6 | sqlalchemy
7 | pydantic
8 |
--------------------------------------------------------------------------------
/Step22/titanic_model.py:
--------------------------------------------------------------------------------
1 | import os
2 | import pickle
3 | import typer
4 | import numpy as np
5 | import pandas as pd
6 | from pydantic import BaseModel
7 | from sqlalchemy import create_engine
8 |
9 | from sklearn.model_selection import train_test_split
10 | from sklearn.linear_model import LogisticRegression
11 | from sklearn.preprocessing import RobustScaler
12 | from sklearn.preprocessing import OneHotEncoder
13 | from sklearn.impute import KNNImputer
14 | from sklearn.metrics import confusion_matrix
15 |
16 |
17 | RARE_TITLES = {
18 | 'Capt',
19 | 'Col',
20 | 'Don',
21 | 'Dona',
22 | 'Dr',
23 | 'Jonkheer',
24 | 'Lady',
25 | 'Major',
26 | 'Mlle',
27 | 'Mme',
28 | 'Ms',
29 | 'Rev',
30 | 'Sir',
31 | 'the Countess',
32 | }
33 |
34 |
35 | class Passenger(BaseModel):
36 | pid: int
37 | pclass: int
38 | sex: str
39 | age: float
40 | family_size: int
41 | fare: float
42 | embarked: str
43 | is_alone: int
44 | title: str
45 | is_survived: int
46 |
47 |
48 | def do_test(filename, data):
49 | if not os.path.isfile(filename):
50 | pickle.dump(data, open(filename, 'wb'))
51 | truth = pickle.load(open(filename, 'rb'))
52 | try:
53 | np.testing.assert_almost_equal(data, truth)
54 | print(f'{filename} test passed')
55 | except AssertionError as ex:
56 | print(f'{filename} test failed {ex}')
57 |
58 |
59 | def do_pandas_test(filename, data):
60 | if not os.path.isfile(filename):
61 | data.to_pickle(filename)
62 | truth = pd.read_pickle(filename)
63 | try:
64 | pd.testing.assert_frame_equal(data, truth)
65 | print(f'{filename} pandas test passed')
66 | except AssertionError as ex:
67 | print(f'{filename} pandas test failed {ex}')
68 |
69 |
70 | class SqlLoader:
71 | def __init__(self, connection_string):
72 | engine = create_engine(connection_string)
73 | self.connection = engine.connect()
74 |
75 | def get_passengers(self):
76 | query = """
77 | SELECT
78 | tbl_passengers.pid,
79 | tbl_passengers.pclass,
80 | tbl_passengers.sex,
81 | tbl_passengers.age,
82 | tbl_passengers.parch,
83 | tbl_passengers.sibsp,
84 | tbl_passengers.fare,
85 | tbl_passengers.embarked,
86 | tbl_passengers.name,
87 | tbl_targets.is_survived
88 | FROM
89 | tbl_passengers
90 | JOIN
91 | tbl_targets
92 | ON
93 | tbl_passengers.pid=tbl_targets.pid
94 | """
95 | return pd.read_sql(query, con=self.connection)
96 |
97 |
98 | class TestLoader:
99 | def __init__(self, passengers_filename, real_loader):
100 | self.passengers_filename = passengers_filename
101 | self.real_loader = real_loader
102 | if not os.path.isfile(self.passengers_filename):
103 | df = self.real_loader.get_passengers()
104 | df.to_pickle(self.passengers_filename)
105 |
106 | def get_passengers(self):
107 | return pd.read_pickle(self.passengers_filename)
108 |
109 |
110 | class ModelSaver:
111 | def __init__(self, model_filename, result_filename):
112 | self.model_filename = model_filename
113 | self.result_filename = result_filename
114 |
115 | def save_model(self, model, result):
116 | pickle.dump(model, open(self.filename, 'wb'))
117 | pickle.dump(result, open(self.result_filename, 'wb'))
118 |
119 |
120 | class TestModelSaver:
121 | def __init__(self):
122 | pass
123 |
124 | def save_model(self, model, result):
125 | do_test('../data/cm_test.pkl', result['cm_test'])
126 | do_test('../data/cm_train.pkl', result['cm_train'])
127 | X_train_processed = model.process_inputs(result['train_passengers'])
128 | do_test('../data/X_train_processed.pkl', X_train_processed)
129 | X_test_processed = model.process_inputs(result['test_passengers'])
130 | do_test('../data/X_test_processed.pkl', X_test_processed)
131 |
132 |
133 | class PassengerLoader:
134 | def __init__(self, loader, rare_titles=None):
135 | self.loader = loader
136 | self.rare_titles = rare_titles
137 |
138 | def get_passengers(self):
139 | passengers = []
140 | for data in self.loader.get_passengers().itertuples():
141 | # parch = Parents/Children, sibsp = Siblings/Spouses
142 | family_size = int(data.parch + data.sibsp)
143 | # Allen, Miss. Elisabeth Walton
144 | title = data.name.split(',')[1].split('.')[0].strip()
145 | passenger = Passenger(
146 | pid=int(data.pid),
147 | pclass=int(data.pclass),
148 | sex=str(data.sex),
149 | age=float(data.age),
150 | family_size=family_size,
151 | fare=float(data.fare),
152 | embarked=str(data.embarked),
153 | is_alone=1 if family_size == 1 else 0,
154 | title='rare' if title in self.rare_titles else title,
155 | is_survived=int(data.is_survived),
156 | )
157 | passengers.append(passenger)
158 | return passengers
159 |
160 |
161 | class TitanicModel:
162 | def __init__(self, n_neighbors=5, predictor=None):
163 | if predictor is None:
164 | predictor = LogisticRegression(random_state=0)
165 | self.trained = False
166 | self.one_hot_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
167 | self.knn_imputer = KNNImputer(n_neighbors=n_neighbors)
168 | self.robust_scaler = RobustScaler()
169 | self.predictor = predictor
170 |
171 | def process_inputs(self, passengers):
172 | data = pd.DataFrame([v.dict() for v in passengers])
173 | categorical_data = data[['embarked', 'sex', 'pclass', 'title', 'is_alone']]
174 | numerical_data = data[['age', 'fare', 'family_size']]
175 | if self.trained:
176 | categorical_data = self.one_hot_encoder.transform(categorical_data)
177 | numerical_data = self.robust_scaler.transform(
178 | self.knn_imputer.transform(numerical_data)
179 | )
180 | else:
181 | categorical_data = self.one_hot_encoder.fit_transform(categorical_data)
182 | numerical_data = self.robust_scaler.fit_transform(
183 | self.knn_imputer.fit_transform(numerical_data)
184 | )
185 | return np.hstack((categorical_data, numerical_data))
186 |
187 | def train(self, passengers):
188 | targets = [v.is_survived for v in passengers]
189 | inputs = self.process_inputs(passengers)
190 | self.predictor.fit(inputs, targets)
191 | self.trained = True
192 |
193 | def estimate(self, passengers):
194 | inputs = self.process_inputs(passengers)
195 | return self.predictor.predict(inputs)
196 |
197 |
198 | class TitanicModelCreator:
199 | def __init__(self, loader, model, model_saver):
200 | self.loader = loader
201 | self.model = model
202 | self.model_saver = model_saver
203 | np.random.seed(42)
204 |
205 | def split_passengers(self, passengers):
206 | passengers_map = {p.pid: p for p in passengers}
207 | pids = [passenger.pid for passenger in passengers]
208 | targets = [passenger.is_survived for passenger in passengers]
209 | train_pids, test_pids = train_test_split(pids, stratify=targets, test_size=0.2)
210 | train_passengers = [passengers_map[pid] for pid in train_pids]
211 | test_passengers = [passengers_map[pid] for pid in test_pids]
212 | return train_passengers, test_passengers
213 |
214 | def run(self):
215 | passengers = self.loader.get_passengers()
216 | train_passengers, test_passengers = self.split_passengers(passengers)
217 |
218 | # --- TRAINING ---
219 | self.model.train(train_passengers)
220 | y_train_estimation = self.model.estimate(train_passengers)
221 | cm_train = confusion_matrix(
222 | [v.is_survived for v in train_passengers], y_train_estimation
223 | )
224 |
225 | # --- TESTING ---
226 | y_test_estimation = self.model.estimate(test_passengers)
227 | cm_test = confusion_matrix(
228 | [v.is_survived for v in test_passengers], y_test_estimation
229 | )
230 |
231 | self.model_saver.save_model(
232 | model=self.model,
233 | result={
234 | 'cm_train': cm_train,
235 | 'cm_test': cm_test,
236 | 'train_passengers': train_passengers,
237 | 'test_passengers': test_passengers,
238 | },
239 | )
240 |
241 |
242 | def main(param: str = 'pass'):
243 | titanic_model_creator = TitanicModelCreator(
244 | loader=PassengerLoader(
245 | loader=SqlLoader(connection_string='sqlite:///../data/titanic.db'),
246 | rare_titles=RARE_TITLES,
247 | ),
248 | model=TitanicModel(),
249 | model_saver=ModelSaver(
250 | model_filename='../data/real_model.pkl',
251 | result_filename='../data/real_result.pkl',
252 | ),
253 | )
254 | titanic_model_creator.run()
255 |
256 |
257 | def test_main(param: str = 'pass'):
258 | titanic_model_creator = TitanicModelCreator(
259 | loader=PassengerLoader(
260 | loader=TestLoader(
261 | passengers_filename='../data/passengers_with_is_survived.pkl',
262 | real_loader=SqlLoader(connection_string='sqlite:///../data/titanic.db'),
263 | ),
264 | rare_titles=RARE_TITLES,
265 | ),
266 | model=TitanicModel(n_neighbors=5, predictor=LogisticRegression(random_state=0)),
267 | model_saver=TestModelSaver(),
268 | )
269 | titanic_model_creator.run()
270 |
271 |
272 | if __name__ == "__main__":
273 | typer.run(test_main)
274 |
--------------------------------------------------------------------------------
/Step22/titanic_model.sh:
--------------------------------------------------------------------------------
1 | python3.8 ./titanic_model.py $1
2 |
--------------------------------------------------------------------------------
/create_branch.sh:
--------------------------------------------------------------------------------
1 |
2 | switch=$1
3 |
4 | if [ -z "$switch" ]; then
5 | echo "use ./create_branch -do"
6 | else
7 | cp Step02/* Step01
8 | cp Step03/* Step02
9 | cp Step04/* Step03
10 | cp Step05/* Step04
11 | cp Step06/* Step05
12 | cp Step07/* Step06
13 | cp Step08/* Step07
14 | cp Step09/* Step08
15 | cp Step10/* Step09
16 | cp Step11/* Step10
17 | cp Step12/* Step11
18 | cp Step13/* Step12
19 | cp Step14/* Step13
20 | cp Step15/* Step14
21 | cp Step16/* Step15
22 | cp Step17/* Step16
23 | cp Step18/* Step17
24 | cp Step19/* Step18
25 | cp Step20/* Step19
26 | cp Step21/* Step20
27 | cp Step22/* Step21
28 | fi
29 |
--------------------------------------------------------------------------------
/create_instructions.sh:
--------------------------------------------------------------------------------
1 | find . -name "README??.md" | sort | xargs cat > INSTRUCTIONS.md
2 |
--------------------------------------------------------------------------------
/create_sqlite.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | import pandas as pd
3 | import typer
4 | from sklearn.datasets import fetch_openml
5 | from sqlalchemy import create_engine
6 |
7 |
8 | def main():
9 | print('loading data')
10 | df, targets = fetch_openml("titanic", version=1, as_frame=True, return_X_y=True)
11 | targets = pd.DataFrame(np.array([int(v) for v in targets]), columns=['is_survived'])
12 |
13 | print('creating db')
14 | engine = create_engine('sqlite:///titanic.db', echo=True)
15 | sqlite_connection = engine.connect()
16 |
17 | print('saving passengers')
18 | df.to_sql('tbl_passengers', sqlite_connection, index_label='pid')
19 | print('saving targets')
20 | targets.to_sql('tbl_targets', sqlite_connection, index_label='pid')
21 |
22 | print('closing db')
23 | sqlite_connection.close()
24 |
25 | print('done')
26 |
27 |
28 | if __name__ == "__main__":
29 | typer.run(main)
30 |
--------------------------------------------------------------------------------
/data/titanic.db:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/xLaszlo/CQ4DS-notebook-sklearn-refactoring-exercise/16059caade85caeb6f4ccf7e492e1d67f87d28a5/data/titanic.db
--------------------------------------------------------------------------------
/make_venv.sh:
--------------------------------------------------------------------------------
1 | python3.8 -m venv .venv
2 | source .venv/bin/activate
3 | pip3 install --upgrade pip
4 | pip3 install setuptools==57.1.0
5 | pip3 install wheel
6 | pip3 install -r requirements.txt
7 |
--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | typer
2 | scikit-learn
3 | jupyter
4 | pandas
5 | numpy
6 | sqlalchemy
7 | pydantic
8 | black
9 |
--------------------------------------------------------------------------------
/test_all.sh:
--------------------------------------------------------------------------------
1 | rm data/*.pkl
2 | echo "Step05" && cd Step05 && ./titanic_model.sh && cd ..
3 | echo "Step06" && cd Step06 && ./titanic_model.sh && cd ..
4 | echo "Step07" && cd Step07 && ./titanic_model.sh && cd ..
5 | echo "Step08" && cd Step08 && ./titanic_model.sh && cd ..
6 | echo "Step09" && cd Step09 && ./titanic_model.sh && cd ..
7 | echo "Step10" && cd Step10 && ./titanic_model.sh && cd ..
8 | echo "Step11" && cd Step11 && ./titanic_model.sh && cd ..
9 | echo "Step12" && cd Step12 && ./titanic_model.sh && cd ..
10 | echo "Step13" && cd Step13 && ./titanic_model.sh && cd ..
11 | echo "Step14" && cd Step14 && ./titanic_model.sh && cd ..
12 | echo "Step15" && cd Step15 && ./titanic_model.sh && cd ..
13 | echo "Step16" && cd Step16 && ./titanic_model.sh && cd ..
14 | echo "Step17" && cd Step17 && ./titanic_model.sh && cd ..
15 | echo "Step18" && cd Step18 && ./titanic_model.sh && cd ..
16 | echo "Step19" && cd Step19 && ./titanic_model.sh && cd ..
17 | echo "Step20" && cd Step20 && ./titanic_model.sh && cd ..
18 | echo "Step21" && cd Step21 && ./titanic_model.sh && cd ..
19 | echo "Step22" && cd Step22 && ./titanic_model.sh && cd ..
20 |
--------------------------------------------------------------------------------