├── .gitignore ├── INSTRUCTIONS.md ├── README.md ├── Step00 ├── README.md └── Slide0_Notebook.ipynb ├── Step01 ├── .gitkeep └── README00.md ├── Step02 ├── README01.md ├── make_venv.sh └── requirements.txt ├── Step03 ├── README02.md ├── make_venv.sh ├── requirements.txt ├── titanic_model.py └── titanic_model.sh ├── Step04 ├── README03.md ├── make_venv.sh ├── requirements.txt ├── titanic_model.py └── titanic_model.sh ├── Step05 ├── README04.md ├── make_venv.sh ├── requirements.txt ├── titanic_model.py └── titanic_model.sh ├── Step06 ├── README05.md ├── make_venv.sh ├── requirements.txt ├── titanic_model.py └── titanic_model.sh ├── Step07 ├── README06.md ├── make_venv.sh ├── requirements.txt ├── titanic_model.py └── titanic_model.sh ├── Step08 ├── README07.md ├── make_venv.sh ├── requirements.txt ├── titanic_model.py └── titanic_model.sh ├── Step09 ├── README08.md ├── make_venv.sh ├── requirements.txt ├── titanic_model.py └── titanic_model.sh ├── Step10 ├── README09.md ├── make_venv.sh ├── requirements.txt ├── titanic_model.py └── titanic_model.sh ├── Step11 ├── README10.md ├── make_venv.sh ├── requirements.txt ├── titanic_model.py └── titanic_model.sh ├── Step12 ├── README11.md ├── make_venv.sh ├── requirements.txt ├── titanic_model.py └── titanic_model.sh ├── Step13 ├── README12.md ├── make_venv.sh ├── requirements.txt ├── titanic_model.py └── titanic_model.sh ├── Step14 ├── README13.md ├── make_venv.sh ├── requirements.txt ├── titanic_model.py └── titanic_model.sh ├── Step15 ├── README14.md ├── make_venv.sh ├── requirements.txt ├── titanic_model.py └── titanic_model.sh ├── Step16 ├── README15.md ├── make_venv.sh ├── requirements.txt ├── titanic_model.py └── titanic_model.sh ├── Step17 ├── README16.md ├── make_venv.sh ├── requirements.txt ├── titanic_model.py └── titanic_model.sh ├── Step18 ├── README17.md ├── make_venv.sh ├── requirements.txt ├── titanic_model.py └── titanic_model.sh ├── Step19 ├── README18.md ├── make_venv.sh ├── requirements.txt ├── titanic_model.py └── titanic_model.sh ├── Step20 ├── README19.md ├── make_venv.sh ├── requirements.txt ├── titanic_model.py └── titanic_model.sh ├── Step21 ├── README20.md ├── make_venv.sh ├── requirements.txt ├── titanic_model.py └── titanic_model.sh ├── Step22 ├── README21.md ├── make_venv.sh ├── requirements.txt ├── titanic_model.py └── titanic_model.sh ├── create_branch.sh ├── create_instructions.sh ├── create_sqlite.py ├── data └── titanic.db ├── make_venv.sh ├── requirements.txt └── test_all.sh /.gitignore: -------------------------------------------------------------------------------- 1 | .venv/ 2 | .vscode/ 3 | .idea/ 4 | .ipynb_checkpoints/ 5 | __pycache__ 6 | __pycache__/* 7 | data/*.pkl 8 | -------------------------------------------------------------------------------- /INSTRUCTIONS.md: -------------------------------------------------------------------------------- 1 | ### Step 01: Project setup 2 | 3 | - Write script to create virtual environment 4 | - Write the first `requirements.txt` 5 | 6 | You can select a different `setuptools` version or pin the package versions. 7 | ### Step 02: Code setup 8 | 9 | - Write python script stub with `typer` 10 | - Write shell script to execute the python script 11 | 12 | `Typer` is an amazing tool that turns any python script into shell scripts. Here we use it for future-proofing because at the moment there are no CLI arguments. 13 | 14 | The program will be defined in a class that is instantiated by the `main()` function and call its main `run()` entry point. The `main()` function will be called by `typer` to pass any CLI parameters. This setup will allow us to create a "plugin" architecture and construct different behaviour (e.g.: normal, test, production) in different main functions. This is a form of "Clean Architecture" where the code (the class) is independent of the infrastructure that calls it (`main()`) more on this: [Clean Architecture: How to structure your ML projects to reduce technical debt (PyData London 2022)](https://laszlo.substack.com/p/slides-for-my-talk-at-pydata-london). 15 | ### Step 03: Move code out of the notebook 16 | 17 | - Copy-paste everything into the `run()` function 18 | 19 | First step is to get started. There will be plenty of steps to structure the code better. 20 | ### Step 04: Move over the tests 21 | 22 | - Copy paste tests and testing code from the notebook in `Step00` into the `run()` function. 23 | 24 | This will implement very simple end-to-end testing which is less effort than unit testing given that the code is not really in a testable state. It caches the value of some variables and the next time you run the code it will compare it to this cache. If they match you didn't change the behaviour of the code with the last change. If your intentions was indeed to change the behaviour, verify from the output of the `AssertionError` that the changes are working as intended. If they are, delete the chaches and rerun the code to generate new reference values. The tests should be such that if they fail they produce meaningful differences. So instead of aggregate statistics (like an F1 score) test the datasets itself. That way even small changes won't go undetected. Once the code is refactored you can write different type of tests but that's a different story. 25 | ### Step 05: Decouple from the database 26 | 27 | - Write `SQLLoader` class 28 | - Move database related code into it 29 | - Replace database calls with interface calls in `run()` 30 | 31 | This is a typical example of the Adapter Pattern. Instead of directly calling the DB, we access it through an intermediary preparing for establishing "Loose Coupling" and "Dependency Inversion". In Clean Architecture the main code (the `run()` function) shouldn't know where the data is coming from, just what the data is. This will bring flexibility because this adapter can be replaced with another one that has the same interface but gets the data from a file. After that you can run your main code without a database which makes it more testable. More on this: [You only need 2 Design Patterns to improve the quality of your code in a data science project](https://laszlo.substack.com/p/you-only-need-2-design-patterns-to). 32 | ### Step 06: Decouple from the database 33 | 34 | - Create loader propery and argument in `TitanicModelCreator.__init__()` 35 | - Remove the database loader instantiation from the `run()` function 36 | - Update `TitanicModelCreator` construction to create the loader there 37 | 38 | This will enable for the `TitanicModelCreator` to load data from any source for example files. Preparing to build a test context for rapid iteration. After you created the adapter class, this will do the decoupling. This is an example of "Dependency Injection", when a property of your main code is not written into the main body of the code but instead "plugged in" at constrcution time. The benefit of Dependency Injection is that you can change the behaviour of your code without rewriting it by purely changing its construction. As the saying goes: "Complex behaviour is constructed not written." Dependency Injection Principle is the `D` in the famed `SOLID` principles, and arguably the most important. 39 | ### Step 07: Write testing dataloader 40 | 41 | - Write a class that loads the required data from files 42 | - Same interface as `SqlLoader` 43 | - Add a "real" loader to it as a property 44 | 45 | This will allow the test context to work without DB connection and still have the DB as a fallback when you run it for the first time. For `TitanicModelCreator` the two loaders are indistinguishable as they have the same interface. 46 | ### Step 08: Write the test context 47 | 48 | - Create `test_main` function 49 | - Make sure `typer` calls that in `typer.run()` 50 | - Copy the code from `main` to `test_main` 51 | - Replace `SqlLoader` in it with `TestLoader` 52 | 53 | From now on this is the only code that is tested. The costly connection to the DB is replaced with a file load. Also if it is still not fast enough, additional parameter can reduce the amount of data in the test to make the process faster. [How can a Data Scientist refactor Jupyter notebooks towards production-quality code?](https://laszlo.substack.com/p/how-can-a-data-scientist-refactor) [I appreciate this might be terse. Comment, open an issue, vote on it if you would like to have a detailed discussion on this - Laszlo] 54 | 55 | This is the essence of the importance of Clean Architecture and code reuse. Every code will be used in two different context: test and "production" by injecting different dependencies. Because the same code runs in both places there is no time spent on translating from one to another. The test setup should reflect production context as close as possible so when a test fail or pass you can think that the same will happen in production as well. This speed up iteration because you can freely experiment in the test context and only deploy code into "production" when you are convinced it is doing what you think it should do. But it is the same code, so deployment is effortless. 56 | ### Step 09: Merge passenger data with targets 57 | 58 | - Remove the `get_targets()` interface 59 | - Replace the query in `SqlLoader` 60 | - Remove any code related to `targets` 61 | 62 | This is a step to prepare to build the "domain data model". The Titanic model is about survival of her passengers. For the code to align this domain the concept of "passengers" need to be introduced (as a class/object). A passenger either survived or not, it's an attribute of the passenger and it need to be implemented like that. 63 | 64 | This is a critical part of the code quality journey and building better systems. Once you introduce these concepts your code will depend directly on the business problem you are solving not the various representations the data is stored (pandas, numpy, csv, etc). I wrote about this many times on my blog: 65 | 66 | - [3 Ways Domain Data Models help Data Science Projects](https://laszlo.substack.com/p/3-ways-domain-data-models-help-data) 67 | - [Clean Architecture: How to structure your ML projects to reduce technical debt](https://laszlo.substack.com/p/slides-for-my-talk-at-pydata-london) 68 | - [How did I change my mind about dataclasses in ML projects?](https://laszlo.substack.com/p/how-did-i-change-my-mind-about-dataclasses) 69 | ### Step 10: Create Passenger class 70 | 71 | - Import `BaseModel` from `pydantic` 72 | - Create the class by inspecting: 73 | - The `dtype` of columns used in `df` 74 | - The actual values in `df` 75 | - The names of the columns that are used later in the code 76 | 77 | There is really no shortcut here. In a "real" project defining this class would be the first step, but in legacy you need to deal with it later. The benefit of domain data objects is that any time you use them you can assume they fulfill a set of assumptions. These can be made explicit with `pydantic's` validators. One goal of the refactoring is to make sure that most interaction between classes happen with domain data objects. This simplifies structuring the project, any future data related change has a well defined place to happen. 78 | ### Step 11: Create domain data object based data loader 79 | 80 | - Create `PassengerLoader` class that takes a "real"/"old" loader 81 | - In its `get_passengers` function, load the data from the loader and create the `Passenger` objects 82 | - Copy the data transformations from `TitanicModelCreator.run()` 83 | 84 | Take a look at how the `rare_titles` variable is used in `run()`. After scanning the entire dataset for titles, the ones that appear less than 10 times are selected. This can be done only if you have access to the entire database and this list needs to be maintained. This can cause problems in a real setting when the above operation is too difficult to do. For example if you have millions of items or a constant stream. This kind of dependencies are common in legacy code and one of the goals of refactoring is to identify these and make explicit. Here we will use a constant but in a productionised environment this might need a whole separate service. 85 | 86 | `PassengerLoader` implements the Factory Design Pattern. Factories are classes that create other classes, they are a type of adapter that hides away where the data is coming from and how is it stored and return only abstract domain relevant classes that you can use downstream. Factories are one of two (later increased to three) fundamentally relevant Design Patterns for Data Science workflows: 87 | 88 | - [You only need 2 Design Patterns to improve the quality of your code in a data science project](https://laszlo.substack.com/p/you-only-need-2-design-patterns-to) 89 | - [Clean Architecture: How to structure your ML projects to reduce technical debt](https://laszlo.substack.com/p/slides-for-my-talk-at-pydata-london) 90 | ### Step 12: Remove any data that is not explicitly needed 91 | 92 | - Update the query in `SqlLoader` to only retrieve the columns that will be used for the model's input 93 | 94 | Simplifying down to the minimum is a goal of refactoring. Anything that is not explicitly needed should be removed. If the requirements change they can be added back again. For example the `ticket` column is in `df` but it is never used again in the program. Remove it. 95 | ### Step 13: Use Passenger objects in the program 96 | 97 | - Add `PassengerLoader` to `main` and `test_main` 98 | - Add the `RARE_TITLES` constant 99 | - Convert the classes back into the `df` dataframe with `passenger.dict()` 100 | 101 | It is very important to do refactoring incrementally. Any change should be small enough that if the tests fail the source can be found quickly. So for now we stop at using the new loader but do not change anything else. 102 | ### Step 14: Separate training and evaluation functions 103 | 104 | - Move all code related to evaluation (variables that has `_test_` in their name) into one group 105 | 106 | After creating the model first it is trained, then it is evaluated on the training data, then it is evaluated on the testing data. These should be separated from each other into their own logical place. This will prepare to move them into an actually separated place. 107 | ### Step 15: Create `TitanicModel` class 108 | 109 | - Create a class that has all the `sklearn` components as member variables 110 | - Instantiate these before the "Training" block 111 | - Use these instead of the local ones 112 | 113 | The goal of the whole program is to create a model, despite this until now there was no single object describing this model. The next steps is to establish the concept of this model and what kind of services it is providing for `TitanicModelCreator`. 114 | ### Step 16: Passenger class based training and evaluation sets 115 | 116 | - Create a function in `TitanicModelCreator` that splits the `passengers` stratified by the "targets" (namely if the passenger survived or not) 117 | - Refactor `X_train/X_test` to be created from these lists of passengers 118 | 119 | Because `train_test_split` works on lists, we extract the pids and the targets from the classes and create the two sets from a mapping from pids to passenger classes. 120 | ### Step 17: Create input processing for `TitanicModel` 121 | 122 | - Move code in `run()` from between instantiating `TitanicModel` and training (`model.predictor.fit`) to the `process_inputs` function of `TitanicModel`. 123 | - Introduce `self.trained` boolean 124 | - Based on `self.trained` either call the `transform` or `fit_transform` of the `sklearn` input processor functions 125 | 126 | All the input transformation code happen twice. Once for training data once for evaluation data. While transforming the data is a responsibility of the model. This is a codesmell called "feature envy". `TitanicModelCreator` envies the functionality from `TitanicModel`. There will be several steps to resolve this. The resulting code will create a self contained model that can be shipped independetly from its creator. 127 | 128 | ### Step 18: Move training into `TitanicModel` 129 | 130 | - Use the same interface as `process_inputs` with `train()` 131 | - Process the data with `process_inputs` (just pass through the arguments) 132 | - Recreate the required targets with the mapping 133 | - Train the model and set the `trained` boolean to `True` 134 | ### Step 19: Move prediction to `TitanicModel` 135 | 136 | - Create the `estimate` function 137 | - Call `proccess_inputs` and `predictor.predict` in it 138 | - Remove all evaluation input processing code 139 | - Call `estimate` from `run` 140 | 141 | Because there was no separation of concerns the input processing code was duplicated and now that we moved it to its own location it can be removed. 142 | 143 | `X_train_processed` and `X_test_processed` do not exist any more so to pass the tests they need to be recreated. This is a good point to think about why this is necessary and find a different way to test behaviour. To keep the project short we set aside this but this would be a good place to introduce more tests. 144 | ### Step 20: Save model and move tests to custom model savers 145 | 146 | - Create `ModelSaver` that has a `save_model` interface that accepts a model and a result object 147 | - Pickle the model and the result to a file 148 | - Create `TestModelSaver` that has the same interface 149 | - Move the testing code to the `save_model` functon 150 | - Add `model_saver` property to `TitanicModelCreator` and call it after the evaluation code 151 | - Add an instance of `ModelSaver` and `TestModelSaver` respectively in `main` and `test_main` to the construction of `TitanicModelCreator` 152 | 153 | Currently `TitanicModelCreator` contains its own testing, while this is intended to run in production. It also have no way to save the model. We will introduce the concept of `ModelSaver` here, anything that need to be preserved after the model training need to be passed to this class. 154 | 155 | We will also move testing into a specific `TestModelSaver` that will instead of saving the model, will run the tests that were otherwise be in `run()`. This way the same code can run in production and in testing without change. 156 | ### Step 21: Enable training of different models 157 | 158 | - Add `model` property to `TitanicModelCreator` and use in `run()` that instead of the local `TitanicModel` instance. 159 | - Add `TitanicModel` instantiation to the creation of `TitanicModelCreator` in both `main` and `test_main` 160 | - Expose parts of `TitanicModel` (predictor, processing parameter) 161 | 162 | At this point the refactoring is pretty much finished. This last step enables the creation of different models. Use existing implementations as templates to create new shell scripts, main functions (contexts) for each experiment that uses new Loaders to create new datasets. Write different test context to make sure the changes you do are as intended.As more experiments emerge, you will see patterns and opportunities to extract common behaviour from similar implementation while still maintaining validity through thes tests. This allows restructuring your code on the fly and find out what is the most convenient architecture for your system. Most problems in these systems are unforeseeable, there is no possibility to figure out the best structure before you start implementation. This require a workflow that enables radical changes even at later stages of the project. Clean Architecture, end-to-end testing and maintaining code quality provides exactly this feature at very low effort. 163 | 164 | Next steps: 165 | 166 | - Use different data: 167 | - Update `SqlLoader` to retrieve different data 168 | - Update `Passenger` class to contain this new data 169 | - Update `PassengerLoader` class to process this new data into the classes 170 | - Update `process_inputs` to create features out of this new data 171 | - Use different features 172 | - Update `process_inputs` in `TitanicModel`, expose parameters as needed 173 | - Use different model: 174 | - Use different `predictor` in `TitanicModel` 175 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # CQ4DS Notebook Sklearn Refactoring Exercise 2 | 3 | This step-by-step programme demonstrates how to refactor a Data Science project from notebooks to well-formed classes and scripts. 4 | 5 | ### The project: 6 | 7 | The notebook demonstrates a typical setup of a data science project: 8 | 9 | - Connects to a database (included in the repository as an SQLite file). 10 | - Gathers some data (the classic Titanic example). 11 | - Does feature engineering. 12 | - Fits a model to estimate survival (sklearn's LogisticRegression). 13 | - Evaluates the model. 14 | 15 | ### Context, vision, 16 | 17 | I wrote a detailed post on the concepts, strategy and big picture thinking. I recommend reading it parallel with the instructions and the steps in the pull request while you are doing the exercises: 18 | 19 | [https://laszlo.substack.com/p/refactoring-the-titanic](https://laszlo.substack.com/p/refactoring-the-titanic) 20 | 21 | ### Refactoring 22 | 23 | The programme demonstrates how to improve code quality, increase agility and prepare for unforeseen changes in a real-world project (see `INSTRUCTIONS.md` for reference reading). You will perform the following steps: 24 | 25 | - Create end-to-end functional testing 26 | - Create shell scripts, command line interfaces, virtual environments 27 | - Decouple from external sources (the Database) 28 | - Refactor with simple Design Patterns (Adapter/Factory/Strategy) 29 | - Improve readability 30 | - Reduce code duplication 31 | 32 | ### Howto: 33 | 34 | - Clone the repository. 35 | - Create a virtual environment with `make_venv.sh`. 36 | - Follow the instructions in `INSTRUCTIONS.md`. 37 | - Run the tests with `titanic_model.sh`. 38 | - Check the diffs of the pull request's steps to verify your progress. 39 | 40 | ### Community: 41 | 42 | For more information and help, join our interactive self-help Code Quality for Data Science (CQ4DS) community on discord: [https://discord.gg/8uUZNMCad2](https://discord.gg/8uUZNMCad2). 43 | 44 | Original project content from and inspired by: [https://jaketae.github.io/study/sklearn-pipeline/](https://jaketae.github.io/study/sklearn-pipeline/) 45 | -------------------------------------------------------------------------------- /Step00/README.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/xLaszlo/CQ4DS-notebook-sklearn-refactoring-exercise/16059caade85caeb6f4ccf7e492e1d67f87d28a5/Step00/README.md -------------------------------------------------------------------------------- /Step00/Slide0_Notebook.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "import os\n", 10 | "import pickle\n", 11 | "import numpy as np\n", 12 | "import pandas as pd\n", 13 | "from collections import Counter\n", 14 | "from sqlalchemy import create_engine\n", 15 | "\n", 16 | "from sklearn.model_selection import train_test_split\n", 17 | "from sklearn.linear_model import LogisticRegression\n", 18 | "from sklearn.preprocessing import RobustScaler\n", 19 | "from sklearn.preprocessing import OneHotEncoder\n", 20 | "from sklearn.impute import KNNImputer\n", 21 | "from sklearn.metrics import confusion_matrix" 22 | ] 23 | }, 24 | { 25 | "cell_type": "code", 26 | "execution_count": 2, 27 | "metadata": {}, 28 | "outputs": [ 29 | { 30 | "data": { 31 | "text/html": [ 32 | "
\n", 33 | "\n", 46 | "\n", 47 | " \n", 48 | " \n", 49 | " \n", 50 | " \n", 51 | " \n", 52 | " \n", 53 | " \n", 54 | " \n", 55 | " \n", 56 | " \n", 57 | " \n", 58 | " \n", 59 | " \n", 60 | " \n", 61 | " \n", 62 | " \n", 63 | " \n", 64 | " \n", 65 | " \n", 66 | " \n", 67 | " \n", 68 | " \n", 69 | " \n", 70 | " \n", 71 | " \n", 72 | " \n", 73 | " \n", 74 | " \n", 75 | "
typenametbl_namerootpagesql
0tabletbl_passengerstbl_passengers2CREATE TABLE tbl_passengers (\\n\\tpid BIGINT, \\...
1tabletbl_targetstbl_targets35CREATE TABLE tbl_targets (\\n\\tpid BIGINT, \\n\\t...
\n", 76 | "
" 77 | ], 78 | "text/plain": [ 79 | " type name tbl_name rootpage \\\n", 80 | "0 table tbl_passengers tbl_passengers 2 \n", 81 | "1 table tbl_targets tbl_targets 35 \n", 82 | "\n", 83 | " sql \n", 84 | "0 CREATE TABLE tbl_passengers (\\n\\tpid BIGINT, \\... \n", 85 | "1 CREATE TABLE tbl_targets (\\n\\tpid BIGINT, \\n\\t... " 86 | ] 87 | }, 88 | "execution_count": 2, 89 | "metadata": {}, 90 | "output_type": "execute_result" 91 | } 92 | ], 93 | "source": [ 94 | "engine = create_engine('sqlite:///../data/titanic.db')\n", 95 | "sqlite_connection = engine.connect()\n", 96 | "pd.read_sql('SELECT * FROM sqlite_schema WHERE type=\"table\"', con=sqlite_connection)" 97 | ] 98 | }, 99 | { 100 | "cell_type": "code", 101 | "execution_count": 3, 102 | "metadata": {}, 103 | "outputs": [], 104 | "source": [ 105 | "np.random.seed(42)\n", 106 | "\n", 107 | "df = pd.read_sql('SELECT * FROM tbl_passengers', con=sqlite_connection)\n", 108 | "\n", 109 | "targets = pd.read_sql('SELECT * FROM tbl_targets', con=sqlite_connection)\n", 110 | "\n", 111 | "# df, targets = fetch_openml(\"titanic\", version=1, as_frame=True, return_X_y=True)\n", 112 | "\n", 113 | "# parch = Parents/Children, sibsp = Siblings/Spouses\n", 114 | "df['family_size'] = df['parch'] + df['sibsp']\n", 115 | "df['is_alone'] = [1 if family_size==1 else 0 for family_size in df['family_size']]\n", 116 | "\n", 117 | "df['title'] = [name.split(',')[1].split('.')[0].strip() for name in df['name']]\n", 118 | "rare_titles = {k for k,v in Counter(df['title']).items() if v < 10}\n", 119 | "df['title'] = ['rare' if title in rare_titles else title for title in df['title']]\n", 120 | "\n", 121 | "df = df[[\n", 122 | " 'pclass', 'sex', 'age', 'ticket', 'family_size',\n", 123 | " 'fare', 'embarked', 'is_alone', 'title'\n", 124 | "]]\n", 125 | "\n", 126 | "targets = [int(v) for v in targets['is_survived']]\n", 127 | "X_train, X_test, y_train, y_test = train_test_split(df, targets, stratify=targets, test_size=0.2)" 128 | ] 129 | }, 130 | { 131 | "cell_type": "code", 132 | "execution_count": 4, 133 | "metadata": {}, 134 | "outputs": [ 135 | { 136 | "data": { 137 | "text/html": [ 138 | "
\n", 139 | "\n", 152 | "\n", 153 | " \n", 154 | " \n", 155 | " \n", 156 | " \n", 157 | " \n", 158 | " \n", 159 | " \n", 160 | " \n", 161 | " \n", 162 | " \n", 163 | " \n", 164 | " \n", 165 | " \n", 166 | " \n", 167 | " \n", 168 | " \n", 169 | " \n", 170 | " \n", 171 | " \n", 172 | " \n", 173 | " \n", 174 | " \n", 175 | " \n", 176 | " \n", 177 | " \n", 178 | " \n", 179 | " \n", 180 | " \n", 181 | " \n", 182 | " \n", 183 | " \n", 184 | " \n", 185 | " \n", 186 | " \n", 187 | " \n", 188 | " \n", 189 | " \n", 190 | " \n", 191 | " \n", 192 | " \n", 193 | " \n", 194 | " \n", 195 | " \n", 196 | " \n", 197 | " \n", 198 | " \n", 199 | " \n", 200 | " \n", 201 | " \n", 202 | " \n", 203 | " \n", 204 | " \n", 205 | "
pclasssexageticketfamily_sizefareembarkedis_alonetitle
01.0female29.0000241600.0211.3375S0Miss
11.0male0.91671137813.0151.5500S0Master
21.0female2.00001137813.0151.5500S0Miss
\n", 206 | "
" 207 | ], 208 | "text/plain": [ 209 | " pclass sex age ticket family_size fare embarked is_alone \\\n", 210 | "0 1.0 female 29.0000 24160 0.0 211.3375 S 0 \n", 211 | "1 1.0 male 0.9167 113781 3.0 151.5500 S 0 \n", 212 | "2 1.0 female 2.0000 113781 3.0 151.5500 S 0 \n", 213 | "\n", 214 | " title \n", 215 | "0 Miss \n", 216 | "1 Master \n", 217 | "2 Miss " 218 | ] 219 | }, 220 | "execution_count": 4, 221 | "metadata": {}, 222 | "output_type": "execute_result" 223 | } 224 | ], 225 | "source": [ 226 | "df[:3]" 227 | ] 228 | }, 229 | { 230 | "cell_type": "code", 231 | "execution_count": 5, 232 | "metadata": {}, 233 | "outputs": [], 234 | "source": [ 235 | "X_train_categorical = X_train[['embarked', 'sex', 'pclass', 'title', 'is_alone']]\n", 236 | "X_test_categorical = X_test[['embarked', 'sex', 'pclass', 'title', 'is_alone']]\n", 237 | "\n", 238 | "one_hot_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False).fit(X_train_categorical)\n", 239 | "X_train_categorical_one_hot = one_hot_encoder.transform(X_train_categorical)\n", 240 | "X_test_categorical_one_hot = one_hot_encoder.transform(X_test_categorical)" 241 | ] 242 | }, 243 | { 244 | "cell_type": "code", 245 | "execution_count": 6, 246 | "metadata": {}, 247 | "outputs": [], 248 | "source": [ 249 | "X_train_numerical = X_train[['age', 'fare', 'family_size']]\n", 250 | "X_test_numerical = X_test[['age', 'fare', 'family_size']]\n", 251 | "knn_imputer = KNNImputer(n_neighbors=5).fit(X_train_numerical)\n", 252 | "X_train_numerical_imputed = knn_imputer.transform(X_train_numerical)\n", 253 | "X_test_numerical_imputed = knn_imputer.transform(X_test_numerical)" 254 | ] 255 | }, 256 | { 257 | "cell_type": "code", 258 | "execution_count": 7, 259 | "metadata": {}, 260 | "outputs": [], 261 | "source": [ 262 | "robust_scaler = RobustScaler().fit(X_train_numerical_imputed)\n", 263 | "X_train_numerical_imputed_scaled = robust_scaler.transform(X_train_numerical_imputed)\n", 264 | "X_test_numerical_imputed_scaled = robust_scaler.transform(X_test_numerical_imputed)" 265 | ] 266 | }, 267 | { 268 | "cell_type": "code", 269 | "execution_count": 8, 270 | "metadata": {}, 271 | "outputs": [], 272 | "source": [ 273 | "X_train_processed = np.hstack((X_train_categorical_one_hot, X_train_numerical_imputed_scaled))\n", 274 | "X_test_processed = np.hstack((X_test_categorical_one_hot, X_test_numerical_imputed_scaled))" 275 | ] 276 | }, 277 | { 278 | "cell_type": "code", 279 | "execution_count": 9, 280 | "metadata": {}, 281 | "outputs": [], 282 | "source": [ 283 | "model = LogisticRegression(random_state=0).fit(X_train_processed, y_train)\n", 284 | "y_train_estimation = model.predict(X_train_processed)\n", 285 | "y_test_estimation = model.predict(X_test_processed)" 286 | ] 287 | }, 288 | { 289 | "cell_type": "code", 290 | "execution_count": 10, 291 | "metadata": {}, 292 | "outputs": [], 293 | "source": [ 294 | "cm_train = confusion_matrix(y_train, y_train_estimation)" 295 | ] 296 | }, 297 | { 298 | "cell_type": "code", 299 | "execution_count": 11, 300 | "metadata": {}, 301 | "outputs": [], 302 | "source": [ 303 | "cm_test = confusion_matrix(y_test, y_test_estimation)" 304 | ] 305 | }, 306 | { 307 | "cell_type": "code", 308 | "execution_count": 12, 309 | "metadata": {}, 310 | "outputs": [ 311 | { 312 | "data": { 313 | "text/plain": [ 314 | "array([[553, 94],\n", 315 | " [107, 293]])" 316 | ] 317 | }, 318 | "execution_count": 12, 319 | "metadata": {}, 320 | "output_type": "execute_result" 321 | } 322 | ], 323 | "source": [ 324 | "cm_train" 325 | ] 326 | }, 327 | { 328 | "cell_type": "code", 329 | "execution_count": 13, 330 | "metadata": {}, 331 | "outputs": [ 332 | { 333 | "data": { 334 | "text/plain": [ 335 | "array([[142, 20],\n", 336 | " [ 22, 78]])" 337 | ] 338 | }, 339 | "execution_count": 13, 340 | "metadata": {}, 341 | "output_type": "execute_result" 342 | } 343 | ], 344 | "source": [ 345 | "cm_test" 346 | ] 347 | }, 348 | { 349 | "cell_type": "code", 350 | "execution_count": 14, 351 | "metadata": {}, 352 | "outputs": [ 353 | { 354 | "name": "stdout", 355 | "output_type": "stream", 356 | "text": [ 357 | "../data/cm_test.pkl test passed\n", 358 | "../data/cm_train.pkl test passed\n", 359 | "../data/X_train_processed.pkl test passed\n", 360 | "../data/X_test_processed.pkl test passed\n" 361 | ] 362 | } 363 | ], 364 | "source": [ 365 | "def do_test(filename, data):\n", 366 | " if not os.path.isfile(filename):\n", 367 | " pickle.dump(data, open(filename, 'wb'))\n", 368 | " truth = pickle.load(open(filename, 'rb'))\n", 369 | " try:\n", 370 | " np.testing.assert_almost_equal(data, truth)\n", 371 | " print(f'{filename} test passed')\n", 372 | " except AssertionError as ex:\n", 373 | " print(f'{filename} test failed {ex}')\n", 374 | " \n", 375 | "do_test('../data/cm_test.pkl', cm_test)\n", 376 | "do_test('../data/cm_train.pkl', cm_train)\n", 377 | "do_test('../data/X_train_processed.pkl', X_train_processed)\n", 378 | "do_test('../data/X_test_processed.pkl', X_test_processed)\n" 379 | ] 380 | }, 381 | { 382 | "cell_type": "code", 383 | "execution_count": 15, 384 | "metadata": {}, 385 | "outputs": [ 386 | { 387 | "name": "stdout", 388 | "output_type": "stream", 389 | "text": [ 390 | "../data/df.pkl pandas test passed\n" 391 | ] 392 | } 393 | ], 394 | "source": [ 395 | "def do_pandas_test(filename, data):\n", 396 | " if not os.path.isfile(filename):\n", 397 | " data.to_pickle(filename)\n", 398 | " truth = pd.read_pickle(filename)\n", 399 | " try:\n", 400 | " pd.testing.assert_frame_equal(data, truth)\n", 401 | " print(f'{filename} pandas test passed')\n", 402 | " except AssertionError as ex:\n", 403 | " print(f'{filename} pandas test failed {ex}')\n", 404 | " \n", 405 | "# df['title'] = ['asd' for v in df['title']]\n", 406 | "do_pandas_test('../data/df.pkl', df)" 407 | ] 408 | }, 409 | { 410 | "cell_type": "code", 411 | "execution_count": 16, 412 | "metadata": {}, 413 | "outputs": [ 414 | { 415 | "data": { 416 | "text/plain": [ 417 | "{'Capt',\n", 418 | " 'Col',\n", 419 | " 'Don',\n", 420 | " 'Dona',\n", 421 | " 'Dr',\n", 422 | " 'Jonkheer',\n", 423 | " 'Lady',\n", 424 | " 'Major',\n", 425 | " 'Mlle',\n", 426 | " 'Mme',\n", 427 | " 'Ms',\n", 428 | " 'Rev',\n", 429 | " 'Sir',\n", 430 | " 'the Countess'}" 431 | ] 432 | }, 433 | "execution_count": 16, 434 | "metadata": {}, 435 | "output_type": "execute_result" 436 | } 437 | ], 438 | "source": [ 439 | "rare_titles" 440 | ] 441 | }, 442 | { 443 | "cell_type": "code", 444 | "execution_count": null, 445 | "metadata": {}, 446 | "outputs": [], 447 | "source": [] 448 | } 449 | ], 450 | "metadata": { 451 | "kernelspec": { 452 | "display_name": "Python 3 (ipykernel)", 453 | "language": "python", 454 | "name": "python3" 455 | }, 456 | "language_info": { 457 | "codemirror_mode": { 458 | "name": "ipython", 459 | "version": 3 460 | }, 461 | "file_extension": ".py", 462 | "mimetype": "text/x-python", 463 | "name": "python", 464 | "nbconvert_exporter": "python", 465 | "pygments_lexer": "ipython3", 466 | "version": "3.8.10" 467 | }, 468 | "vscode": { 469 | "interpreter": { 470 | "hash": "774712da715a3086605d6bf08e7144a3a7e717b0d5585da12e288357dd4c8f07" 471 | } 472 | } 473 | }, 474 | "nbformat": 4, 475 | "nbformat_minor": 4 476 | } 477 | -------------------------------------------------------------------------------- /Step01/.gitkeep: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/xLaszlo/CQ4DS-notebook-sklearn-refactoring-exercise/16059caade85caeb6f4ccf7e492e1d67f87d28a5/Step01/.gitkeep -------------------------------------------------------------------------------- /Step01/README00.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/xLaszlo/CQ4DS-notebook-sklearn-refactoring-exercise/16059caade85caeb6f4ccf7e492e1d67f87d28a5/Step01/README00.md -------------------------------------------------------------------------------- /Step02/README01.md: -------------------------------------------------------------------------------- 1 | ### Step 01: Project setup 2 | 3 | - Write script to create virtual environment 4 | - Write the first `requirements.txt` 5 | 6 | You can select a different `setuptools` version or pin the package versions. 7 | -------------------------------------------------------------------------------- /Step02/make_venv.sh: -------------------------------------------------------------------------------- 1 | python3.8 -m venv .venv 2 | source .venv/bin/activate 3 | pip3 install --upgrade pip 4 | pip3 install setuptools==57.1.0 5 | pip3 install wheel 6 | pip3 install -r requirements.txt 7 | -------------------------------------------------------------------------------- /Step02/requirements.txt: -------------------------------------------------------------------------------- 1 | typer 2 | scikit-learn 3 | jupyter 4 | pandas 5 | numpy 6 | sqlalchemy 7 | pydantic 8 | -------------------------------------------------------------------------------- /Step03/README02.md: -------------------------------------------------------------------------------- 1 | ### Step 02: Code setup 2 | 3 | - Write python script stub with `typer` 4 | - Write shell script to execute the python script 5 | 6 | `Typer` is an amazing tool that turns any python script into shell scripts. Here we use it for future-proofing because at the moment there are no CLI arguments. 7 | 8 | The program will be defined in a class that is instantiated by the `main()` function and call its main `run()` entry point. The `main()` function will be called by `typer` to pass any CLI parameters. This setup will allow us to create a "plugin" architecture and construct different behaviour (e.g.: normal, test, production) in different main functions. This is a form of "Clean Architecture" where the code (the class) is independent of the infrastructure that calls it (`main()`) more on this: [Clean Architecture: How to structure your ML projects to reduce technical debt (PyData London 2022)](https://laszlo.substack.com/p/slides-for-my-talk-at-pydata-london). 9 | -------------------------------------------------------------------------------- /Step03/make_venv.sh: -------------------------------------------------------------------------------- 1 | python3.8 -m venv .venv 2 | source .venv/bin/activate 3 | pip3 install --upgrade pip 4 | pip3 install setuptools==57.1.0 5 | pip3 install wheel 6 | pip3 install -r requirements.txt 7 | -------------------------------------------------------------------------------- /Step03/requirements.txt: -------------------------------------------------------------------------------- 1 | typer 2 | scikit-learn 3 | jupyter 4 | pandas 5 | numpy 6 | sqlalchemy 7 | pydantic 8 | -------------------------------------------------------------------------------- /Step03/titanic_model.py: -------------------------------------------------------------------------------- 1 | import typer 2 | 3 | 4 | class TitanicModelCreator: 5 | def __init__(self): 6 | pass 7 | 8 | def run(self): 9 | print('Hello World!') 10 | 11 | 12 | def main(param: str = 'pass'): 13 | titanic_model_creator = TitanicModelCreator() 14 | titanic_model_creator.run() 15 | 16 | 17 | if __name__ == "__main__": 18 | typer.run(main) 19 | -------------------------------------------------------------------------------- /Step03/titanic_model.sh: -------------------------------------------------------------------------------- 1 | python3.8 ./titanic_model.py $1 2 | -------------------------------------------------------------------------------- /Step04/README03.md: -------------------------------------------------------------------------------- 1 | ### Step 03: Move code out of the notebook 2 | 3 | - Copy-paste everything into the `run()` function 4 | 5 | First step is to get started. There will be plenty of steps to structure the code better. 6 | -------------------------------------------------------------------------------- /Step04/make_venv.sh: -------------------------------------------------------------------------------- 1 | python3.8 -m venv .venv 2 | source .venv/bin/activate 3 | pip3 install --upgrade pip 4 | pip3 install setuptools==57.1.0 5 | pip3 install wheel 6 | pip3 install -r requirements.txt 7 | -------------------------------------------------------------------------------- /Step04/requirements.txt: -------------------------------------------------------------------------------- 1 | typer 2 | scikit-learn 3 | jupyter 4 | pandas 5 | numpy 6 | sqlalchemy 7 | pydantic 8 | -------------------------------------------------------------------------------- /Step04/titanic_model.py: -------------------------------------------------------------------------------- 1 | import typer 2 | import numpy as np 3 | import pandas as pd 4 | from collections import Counter 5 | from sqlalchemy import create_engine 6 | 7 | from sklearn.model_selection import train_test_split 8 | from sklearn.linear_model import LogisticRegression 9 | from sklearn.preprocessing import RobustScaler 10 | from sklearn.preprocessing import OneHotEncoder 11 | from sklearn.impute import KNNImputer 12 | from sklearn.metrics import confusion_matrix 13 | 14 | 15 | class TitanicModelCreator: 16 | def __init__(self): 17 | pass 18 | 19 | def run(self): 20 | engine = create_engine('sqlite:///../data/titanic.db') 21 | sqlite_connection = engine.connect() 22 | pd.read_sql( 23 | 'SELECT * FROM sqlite_schema WHERE type="table"', con=sqlite_connection 24 | ) 25 | np.random.seed(42) 26 | 27 | df = pd.read_sql('SELECT * FROM tbl_passengers', con=sqlite_connection) 28 | 29 | targets = pd.read_sql('SELECT * FROM tbl_targets', con=sqlite_connection) 30 | 31 | # df, targets = fetch_openml("titanic", version=1, as_frame=True, return_X_y=True) 32 | 33 | # parch = Parents/Children, sibsp = Siblings/Spouses 34 | df['family_size'] = df['parch'] + df['sibsp'] 35 | df['is_alone'] = [ 36 | 1 if family_size == 1 else 0 for family_size in df['family_size'] 37 | ] 38 | 39 | df['title'] = [name.split(',')[1].split('.')[0].strip() for name in df['name']] 40 | rare_titles = {k for k, v in Counter(df['title']).items() if v < 10} 41 | df['title'] = [ 42 | 'rare' if title in rare_titles else title for title in df['title'] 43 | ] 44 | 45 | df = df[ 46 | [ 47 | 'pclass', 48 | 'sex', 49 | 'age', 50 | 'ticket', 51 | 'family_size', 52 | 'fare', 53 | 'embarked', 54 | 'is_alone', 55 | 'title', 56 | ] 57 | ] 58 | 59 | targets = [int(v) for v in targets['is_survived']] 60 | X_train, X_test, y_train, y_test = train_test_split( 61 | df, targets, stratify=targets, test_size=0.2 62 | ) 63 | 64 | X_train_categorical = X_train[ 65 | ['embarked', 'sex', 'pclass', 'title', 'is_alone'] 66 | ] 67 | X_test_categorical = X_test[['embarked', 'sex', 'pclass', 'title', 'is_alone']] 68 | 69 | one_hot_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False).fit( 70 | X_train_categorical 71 | ) 72 | X_train_categorical_one_hot = one_hot_encoder.transform(X_train_categorical) 73 | X_test_categorical_one_hot = one_hot_encoder.transform(X_test_categorical) 74 | 75 | X_train_numerical = X_train[['age', 'fare', 'family_size']] 76 | X_test_numerical = X_test[['age', 'fare', 'family_size']] 77 | knn_imputer = KNNImputer(n_neighbors=5).fit(X_train_numerical) 78 | X_train_numerical_imputed = knn_imputer.transform(X_train_numerical) 79 | X_test_numerical_imputed = knn_imputer.transform(X_test_numerical) 80 | 81 | robust_scaler = RobustScaler().fit(X_train_numerical_imputed) 82 | X_train_numerical_imputed_scaled = robust_scaler.transform( 83 | X_train_numerical_imputed 84 | ) 85 | X_test_numerical_imputed_scaled = robust_scaler.transform( 86 | X_test_numerical_imputed 87 | ) 88 | 89 | X_train_processed = np.hstack( 90 | (X_train_categorical_one_hot, X_train_numerical_imputed_scaled) 91 | ) 92 | X_test_processed = np.hstack( 93 | (X_test_categorical_one_hot, X_test_numerical_imputed_scaled) 94 | ) 95 | 96 | model = LogisticRegression(random_state=0).fit(X_train_processed, y_train) 97 | y_train_estimation = model.predict(X_train_processed) 98 | y_test_estimation = model.predict(X_test_processed) 99 | 100 | cm_train = confusion_matrix(y_train, y_train_estimation) 101 | 102 | cm_test = confusion_matrix(y_test, y_test_estimation) 103 | 104 | print('cm_train', cm_train) 105 | print('cm_test', cm_test) 106 | 107 | 108 | def main(param: str = 'pass'): 109 | titanic_model_creator = TitanicModelCreator() 110 | titanic_model_creator.run() 111 | 112 | 113 | if __name__ == "__main__": 114 | typer.run(main) 115 | -------------------------------------------------------------------------------- /Step04/titanic_model.sh: -------------------------------------------------------------------------------- 1 | python3.8 ./titanic_model.py $1 2 | -------------------------------------------------------------------------------- /Step05/README04.md: -------------------------------------------------------------------------------- 1 | ### Step 04: Move over the tests 2 | 3 | - Copy paste tests and testing code from the notebook in `Step00` into the `run()` function. 4 | 5 | This will implement very simple end-to-end testing which is less effort than unit testing given that the code is not really in a testable state. It caches the value of some variables and the next time you run the code it will compare it to this cache. If they match you didn't change the behaviour of the code with the last change. If your intentions was indeed to change the behaviour, verify from the output of the `AssertionError` that the changes are working as intended. If they are, delete the chaches and rerun the code to generate new reference values. The tests should be such that if they fail they produce meaningful differences. So instead of aggregate statistics (like an F1 score) test the datasets itself. That way even small changes won't go undetected. Once the code is refactored you can write different type of tests but that's a different story. 6 | -------------------------------------------------------------------------------- /Step05/make_venv.sh: -------------------------------------------------------------------------------- 1 | python3.8 -m venv .venv 2 | source .venv/bin/activate 3 | pip3 install --upgrade pip 4 | pip3 install setuptools==57.1.0 5 | pip3 install wheel 6 | pip3 install -r requirements.txt 7 | -------------------------------------------------------------------------------- /Step05/requirements.txt: -------------------------------------------------------------------------------- 1 | typer 2 | scikit-learn 3 | jupyter 4 | pandas 5 | numpy 6 | sqlalchemy 7 | pydantic 8 | -------------------------------------------------------------------------------- /Step05/titanic_model.py: -------------------------------------------------------------------------------- 1 | import os 2 | import pickle 3 | import typer 4 | import numpy as np 5 | import pandas as pd 6 | from collections import Counter 7 | from sqlalchemy import create_engine 8 | 9 | from sklearn.model_selection import train_test_split 10 | from sklearn.linear_model import LogisticRegression 11 | from sklearn.preprocessing import RobustScaler 12 | from sklearn.preprocessing import OneHotEncoder 13 | from sklearn.impute import KNNImputer 14 | from sklearn.metrics import confusion_matrix 15 | 16 | 17 | def do_test(filename, data): 18 | if not os.path.isfile(filename): 19 | pickle.dump(data, open(filename, 'wb')) 20 | truth = pickle.load(open(filename, 'rb')) 21 | try: 22 | np.testing.assert_almost_equal(data, truth) 23 | print(f'{filename} test passed') 24 | except AssertionError as ex: 25 | print(f'{filename} test failed {ex}') 26 | 27 | 28 | def do_pandas_test(filename, data): 29 | if not os.path.isfile(filename): 30 | data.to_pickle(filename) 31 | truth = pd.read_pickle(filename) 32 | try: 33 | pd.testing.assert_frame_equal(data, truth) 34 | print(f'{filename} pandas test passed') 35 | except AssertionError as ex: 36 | print(f'{filename} pandas test failed {ex}') 37 | 38 | 39 | class TitanicModelCreator: 40 | def __init__(self): 41 | pass 42 | 43 | def run(self): 44 | engine = create_engine('sqlite:///../data/titanic.db') 45 | sqlite_connection = engine.connect() 46 | pd.read_sql( 47 | 'SELECT * FROM sqlite_schema WHERE type="table"', con=sqlite_connection 48 | ) 49 | np.random.seed(42) 50 | 51 | df = pd.read_sql('SELECT * FROM tbl_passengers', con=sqlite_connection) 52 | 53 | targets = pd.read_sql('SELECT * FROM tbl_targets', con=sqlite_connection) 54 | 55 | # df, targets = fetch_openml("titanic", version=1, as_frame=True, return_X_y=True) 56 | 57 | # parch = Parents/Children, sibsp = Siblings/Spouses 58 | df['family_size'] = df['parch'] + df['sibsp'] 59 | df['is_alone'] = [ 60 | 1 if family_size == 1 else 0 for family_size in df['family_size'] 61 | ] 62 | 63 | df['title'] = [name.split(',')[1].split('.')[0].strip() for name in df['name']] 64 | rare_titles = {k for k, v in Counter(df['title']).items() if v < 10} 65 | df['title'] = [ 66 | 'rare' if title in rare_titles else title for title in df['title'] 67 | ] 68 | 69 | df = df[ 70 | [ 71 | 'pclass', 72 | 'sex', 73 | 'age', 74 | 'ticket', 75 | 'family_size', 76 | 'fare', 77 | 'embarked', 78 | 'is_alone', 79 | 'title', 80 | ] 81 | ] 82 | 83 | targets = [int(v) for v in targets['is_survived']] 84 | X_train, X_test, y_train, y_test = train_test_split( 85 | df, targets, stratify=targets, test_size=0.2 86 | ) 87 | 88 | X_train_categorical = X_train[ 89 | ['embarked', 'sex', 'pclass', 'title', 'is_alone'] 90 | ] 91 | X_test_categorical = X_test[['embarked', 'sex', 'pclass', 'title', 'is_alone']] 92 | 93 | one_hot_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False).fit( 94 | X_train_categorical 95 | ) 96 | X_train_categorical_one_hot = one_hot_encoder.transform(X_train_categorical) 97 | X_test_categorical_one_hot = one_hot_encoder.transform(X_test_categorical) 98 | 99 | X_train_numerical = X_train[['age', 'fare', 'family_size']] 100 | X_test_numerical = X_test[['age', 'fare', 'family_size']] 101 | knn_imputer = KNNImputer(n_neighbors=5).fit(X_train_numerical) 102 | X_train_numerical_imputed = knn_imputer.transform(X_train_numerical) 103 | X_test_numerical_imputed = knn_imputer.transform(X_test_numerical) 104 | 105 | robust_scaler = RobustScaler().fit(X_train_numerical_imputed) 106 | X_train_numerical_imputed_scaled = robust_scaler.transform( 107 | X_train_numerical_imputed 108 | ) 109 | X_test_numerical_imputed_scaled = robust_scaler.transform( 110 | X_test_numerical_imputed 111 | ) 112 | 113 | X_train_processed = np.hstack( 114 | (X_train_categorical_one_hot, X_train_numerical_imputed_scaled) 115 | ) 116 | X_test_processed = np.hstack( 117 | (X_test_categorical_one_hot, X_test_numerical_imputed_scaled) 118 | ) 119 | 120 | model = LogisticRegression(random_state=0).fit(X_train_processed, y_train) 121 | y_train_estimation = model.predict(X_train_processed) 122 | y_test_estimation = model.predict(X_test_processed) 123 | 124 | cm_train = confusion_matrix(y_train, y_train_estimation) 125 | 126 | cm_test = confusion_matrix(y_test, y_test_estimation) 127 | 128 | print('cm_train', cm_train) 129 | print('cm_test', cm_test) 130 | 131 | do_test('../data/cm_test.pkl', cm_test) 132 | do_test('../data/cm_train.pkl', cm_train) 133 | do_test('../data/X_train_processed.pkl', X_train_processed) 134 | do_test('../data/X_test_processed.pkl', X_test_processed) 135 | 136 | do_pandas_test('../data/df.pkl', df) 137 | 138 | 139 | def main(param: str = 'pass'): 140 | titanic_model_creator = TitanicModelCreator() 141 | titanic_model_creator.run() 142 | 143 | 144 | if __name__ == "__main__": 145 | typer.run(main) 146 | -------------------------------------------------------------------------------- /Step05/titanic_model.sh: -------------------------------------------------------------------------------- 1 | python3.8 ./titanic_model.py $1 2 | -------------------------------------------------------------------------------- /Step06/README05.md: -------------------------------------------------------------------------------- 1 | ### Step 05: Decouple from the database 2 | 3 | - Write `SQLLoader` class 4 | - Move database related code into it 5 | - Replace database calls with interface calls in `run()` 6 | 7 | This is a typical example of the Adapter Pattern. Instead of directly calling the DB, we access it through an intermediary preparing for establishing "Loose Coupling" and "Dependency Inversion". In Clean Architecture the main code (the `run()` function) shouldn't know where the data is coming from, just what the data is. This will bring flexibility because this adapter can be replaced with another one that has the same interface but gets the data from a file. After that you can run your main code without a database which makes it more testable. More on this: [You only need 2 Design Patterns to improve the quality of your code in a data science project](https://laszlo.substack.com/p/you-only-need-2-design-patterns-to). 8 | -------------------------------------------------------------------------------- /Step06/make_venv.sh: -------------------------------------------------------------------------------- 1 | python3.8 -m venv .venv 2 | source .venv/bin/activate 3 | pip3 install --upgrade pip 4 | pip3 install setuptools==57.1.0 5 | pip3 install wheel 6 | pip3 install -r requirements.txt 7 | -------------------------------------------------------------------------------- /Step06/requirements.txt: -------------------------------------------------------------------------------- 1 | typer 2 | scikit-learn 3 | jupyter 4 | pandas 5 | numpy 6 | sqlalchemy 7 | pydantic 8 | -------------------------------------------------------------------------------- /Step06/titanic_model.py: -------------------------------------------------------------------------------- 1 | import os 2 | import pickle 3 | import typer 4 | import numpy as np 5 | import pandas as pd 6 | from collections import Counter 7 | from sqlalchemy import create_engine 8 | 9 | from sklearn.model_selection import train_test_split 10 | from sklearn.linear_model import LogisticRegression 11 | from sklearn.preprocessing import RobustScaler 12 | from sklearn.preprocessing import OneHotEncoder 13 | from sklearn.impute import KNNImputer 14 | from sklearn.metrics import confusion_matrix 15 | 16 | 17 | def do_test(filename, data): 18 | if not os.path.isfile(filename): 19 | pickle.dump(data, open(filename, 'wb')) 20 | truth = pickle.load(open(filename, 'rb')) 21 | try: 22 | np.testing.assert_almost_equal(data, truth) 23 | print(f'{filename} test passed') 24 | except AssertionError as ex: 25 | print(f'{filename} test failed {ex}') 26 | 27 | 28 | def do_pandas_test(filename, data): 29 | if not os.path.isfile(filename): 30 | data.to_pickle(filename) 31 | truth = pd.read_pickle(filename) 32 | try: 33 | pd.testing.assert_frame_equal(data, truth) 34 | print(f'{filename} pandas test passed') 35 | except AssertionError as ex: 36 | print(f'{filename} pandas test failed {ex}') 37 | 38 | 39 | class SqlLoader: 40 | def __init__(self, connection_string): 41 | engine = create_engine(connection_string) 42 | self.connection = engine.connect() 43 | 44 | def get_passengers(self): 45 | query = 'SELECT * FROM tbl_passengers' 46 | return pd.read_sql(query, con=self.connection) 47 | 48 | def get_targets(self): 49 | query = 'SELECT * FROM tbl_targets' 50 | return pd.read_sql(query, con=self.connection) 51 | 52 | 53 | class TitanicModelCreator: 54 | def __init__(self): 55 | np.random.seed(42) 56 | 57 | def run(self): 58 | loader = SqlLoader(connection_string='sqlite:///../data/titanic.db') 59 | 60 | df = loader.get_passengers() 61 | targets = loader.get_targets() 62 | 63 | # parch = Parents/Children, sibsp = Siblings/Spouses 64 | df['family_size'] = df['parch'] + df['sibsp'] 65 | df['is_alone'] = [ 66 | 1 if family_size == 1 else 0 for family_size in df['family_size'] 67 | ] 68 | 69 | df['title'] = [name.split(',')[1].split('.')[0].strip() for name in df['name']] 70 | rare_titles = {k for k, v in Counter(df['title']).items() if v < 10} 71 | df['title'] = [ 72 | 'rare' if title in rare_titles else title for title in df['title'] 73 | ] 74 | 75 | df = df[ 76 | [ 77 | 'pclass', 78 | 'sex', 79 | 'age', 80 | 'ticket', 81 | 'family_size', 82 | 'fare', 83 | 'embarked', 84 | 'is_alone', 85 | 'title', 86 | ] 87 | ] 88 | 89 | targets = [int(v) for v in targets['is_survived']] 90 | X_train, X_test, y_train, y_test = train_test_split( 91 | df, targets, stratify=targets, test_size=0.2 92 | ) 93 | 94 | X_train_categorical = X_train[ 95 | ['embarked', 'sex', 'pclass', 'title', 'is_alone'] 96 | ] 97 | X_test_categorical = X_test[['embarked', 'sex', 'pclass', 'title', 'is_alone']] 98 | 99 | one_hot_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False).fit( 100 | X_train_categorical 101 | ) 102 | X_train_categorical_one_hot = one_hot_encoder.transform(X_train_categorical) 103 | X_test_categorical_one_hot = one_hot_encoder.transform(X_test_categorical) 104 | 105 | X_train_numerical = X_train[['age', 'fare', 'family_size']] 106 | X_test_numerical = X_test[['age', 'fare', 'family_size']] 107 | knn_imputer = KNNImputer(n_neighbors=5).fit(X_train_numerical) 108 | X_train_numerical_imputed = knn_imputer.transform(X_train_numerical) 109 | X_test_numerical_imputed = knn_imputer.transform(X_test_numerical) 110 | 111 | robust_scaler = RobustScaler().fit(X_train_numerical_imputed) 112 | X_train_numerical_imputed_scaled = robust_scaler.transform( 113 | X_train_numerical_imputed 114 | ) 115 | X_test_numerical_imputed_scaled = robust_scaler.transform( 116 | X_test_numerical_imputed 117 | ) 118 | 119 | X_train_processed = np.hstack( 120 | (X_train_categorical_one_hot, X_train_numerical_imputed_scaled) 121 | ) 122 | X_test_processed = np.hstack( 123 | (X_test_categorical_one_hot, X_test_numerical_imputed_scaled) 124 | ) 125 | 126 | model = LogisticRegression(random_state=0).fit(X_train_processed, y_train) 127 | y_train_estimation = model.predict(X_train_processed) 128 | y_test_estimation = model.predict(X_test_processed) 129 | 130 | cm_train = confusion_matrix(y_train, y_train_estimation) 131 | 132 | cm_test = confusion_matrix(y_test, y_test_estimation) 133 | 134 | print('cm_train', cm_train) 135 | print('cm_test', cm_test) 136 | 137 | do_test('../data/cm_test.pkl', cm_test) 138 | do_test('../data/cm_train.pkl', cm_train) 139 | do_test('../data/X_train_processed.pkl', X_train_processed) 140 | do_test('../data/X_test_processed.pkl', X_test_processed) 141 | 142 | do_pandas_test('../data/df.pkl', df) 143 | 144 | 145 | def main(param: str = 'pass'): 146 | titanic_model_creator = TitanicModelCreator() 147 | titanic_model_creator.run() 148 | 149 | 150 | if __name__ == "__main__": 151 | typer.run(main) 152 | -------------------------------------------------------------------------------- /Step06/titanic_model.sh: -------------------------------------------------------------------------------- 1 | python3.8 ./titanic_model.py $1 2 | -------------------------------------------------------------------------------- /Step07/README06.md: -------------------------------------------------------------------------------- 1 | ### Step 06: Decouple from the database 2 | 3 | - Create loader propery and argument in `TitanicModelCreator.__init__()` 4 | - Remove the database loader instantiation from the `run()` function 5 | - Update `TitanicModelCreator` construction to create the loader there 6 | 7 | This will enable for the `TitanicModelCreator` to load data from any source for example files. Preparing to build a test context for rapid iteration. After you created the adapter class, this will do the decoupling. This is an example of "Dependency Injection", when a property of your main code is not written into the main body of the code but instead "plugged in" at constrcution time. The benefit of Dependency Injection is that you can change the behaviour of your code without rewriting it by purely changing its construction. As the saying goes: "Complex behaviour is constructed not written." Dependency Injection Principle is the `D` in the famed `SOLID` principles, and arguably the most important. 8 | -------------------------------------------------------------------------------- /Step07/make_venv.sh: -------------------------------------------------------------------------------- 1 | python3.8 -m venv .venv 2 | source .venv/bin/activate 3 | pip3 install --upgrade pip 4 | pip3 install setuptools==57.1.0 5 | pip3 install wheel 6 | pip3 install -r requirements.txt 7 | -------------------------------------------------------------------------------- /Step07/requirements.txt: -------------------------------------------------------------------------------- 1 | typer 2 | scikit-learn 3 | jupyter 4 | pandas 5 | numpy 6 | sqlalchemy 7 | pydantic 8 | -------------------------------------------------------------------------------- /Step07/titanic_model.py: -------------------------------------------------------------------------------- 1 | import os 2 | import pickle 3 | import typer 4 | import numpy as np 5 | import pandas as pd 6 | from collections import Counter 7 | from sqlalchemy import create_engine 8 | 9 | from sklearn.model_selection import train_test_split 10 | from sklearn.linear_model import LogisticRegression 11 | from sklearn.preprocessing import RobustScaler 12 | from sklearn.preprocessing import OneHotEncoder 13 | from sklearn.impute import KNNImputer 14 | from sklearn.metrics import confusion_matrix 15 | 16 | 17 | def do_test(filename, data): 18 | if not os.path.isfile(filename): 19 | pickle.dump(data, open(filename, 'wb')) 20 | truth = pickle.load(open(filename, 'rb')) 21 | try: 22 | np.testing.assert_almost_equal(data, truth) 23 | print(f'{filename} test passed') 24 | except AssertionError as ex: 25 | print(f'{filename} test failed {ex}') 26 | 27 | 28 | def do_pandas_test(filename, data): 29 | if not os.path.isfile(filename): 30 | data.to_pickle(filename) 31 | truth = pd.read_pickle(filename) 32 | try: 33 | pd.testing.assert_frame_equal(data, truth) 34 | print(f'{filename} pandas test passed') 35 | except AssertionError as ex: 36 | print(f'{filename} pandas test failed {ex}') 37 | 38 | 39 | class SqlLoader: 40 | def __init__(self, connection_string): 41 | engine = create_engine(connection_string) 42 | self.connection = engine.connect() 43 | 44 | def get_passengers(self): 45 | query = 'SELECT * FROM tbl_passengers' 46 | return pd.read_sql(query, con=self.connection) 47 | 48 | def get_targets(self): 49 | query = 'SELECT * FROM tbl_targets' 50 | return pd.read_sql(query, con=self.connection) 51 | 52 | 53 | class TitanicModelCreator: 54 | def __init__(self, loader): 55 | self.loader = loader 56 | np.random.seed(42) 57 | 58 | def run(self): 59 | df = self.loader.get_passengers() 60 | targets = self.loader.get_targets() 61 | 62 | # parch = Parents/Children, sibsp = Siblings/Spouses 63 | df['family_size'] = df['parch'] + df['sibsp'] 64 | df['is_alone'] = [ 65 | 1 if family_size == 1 else 0 for family_size in df['family_size'] 66 | ] 67 | 68 | df['title'] = [name.split(',')[1].split('.')[0].strip() for name in df['name']] 69 | rare_titles = {k for k, v in Counter(df['title']).items() if v < 10} 70 | df['title'] = [ 71 | 'rare' if title in rare_titles else title for title in df['title'] 72 | ] 73 | 74 | df = df[ 75 | [ 76 | 'pclass', 77 | 'sex', 78 | 'age', 79 | 'ticket', 80 | 'family_size', 81 | 'fare', 82 | 'embarked', 83 | 'is_alone', 84 | 'title', 85 | ] 86 | ] 87 | 88 | targets = [int(v) for v in targets['is_survived']] 89 | X_train, X_test, y_train, y_test = train_test_split( 90 | df, targets, stratify=targets, test_size=0.2 91 | ) 92 | 93 | X_train_categorical = X_train[ 94 | ['embarked', 'sex', 'pclass', 'title', 'is_alone'] 95 | ] 96 | X_test_categorical = X_test[['embarked', 'sex', 'pclass', 'title', 'is_alone']] 97 | 98 | one_hot_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False).fit( 99 | X_train_categorical 100 | ) 101 | X_train_categorical_one_hot = one_hot_encoder.transform(X_train_categorical) 102 | X_test_categorical_one_hot = one_hot_encoder.transform(X_test_categorical) 103 | 104 | X_train_numerical = X_train[['age', 'fare', 'family_size']] 105 | X_test_numerical = X_test[['age', 'fare', 'family_size']] 106 | knn_imputer = KNNImputer(n_neighbors=5).fit(X_train_numerical) 107 | X_train_numerical_imputed = knn_imputer.transform(X_train_numerical) 108 | X_test_numerical_imputed = knn_imputer.transform(X_test_numerical) 109 | 110 | robust_scaler = RobustScaler().fit(X_train_numerical_imputed) 111 | X_train_numerical_imputed_scaled = robust_scaler.transform( 112 | X_train_numerical_imputed 113 | ) 114 | X_test_numerical_imputed_scaled = robust_scaler.transform( 115 | X_test_numerical_imputed 116 | ) 117 | 118 | X_train_processed = np.hstack( 119 | (X_train_categorical_one_hot, X_train_numerical_imputed_scaled) 120 | ) 121 | X_test_processed = np.hstack( 122 | (X_test_categorical_one_hot, X_test_numerical_imputed_scaled) 123 | ) 124 | 125 | model = LogisticRegression(random_state=0).fit(X_train_processed, y_train) 126 | y_train_estimation = model.predict(X_train_processed) 127 | y_test_estimation = model.predict(X_test_processed) 128 | 129 | cm_train = confusion_matrix(y_train, y_train_estimation) 130 | 131 | cm_test = confusion_matrix(y_test, y_test_estimation) 132 | 133 | print('cm_train', cm_train) 134 | print('cm_test', cm_test) 135 | 136 | do_test('../data/cm_test.pkl', cm_test) 137 | do_test('../data/cm_train.pkl', cm_train) 138 | do_test('../data/X_train_processed.pkl', X_train_processed) 139 | do_test('../data/X_test_processed.pkl', X_test_processed) 140 | 141 | do_pandas_test('../data/df.pkl', df) 142 | 143 | 144 | def main(param: str = 'pass'): 145 | titanic_model_creator = TitanicModelCreator( 146 | loader=SqlLoader(connection_string='sqlite:///../data/titanic.db') 147 | ) 148 | titanic_model_creator.run() 149 | 150 | 151 | if __name__ == "__main__": 152 | typer.run(main) 153 | -------------------------------------------------------------------------------- /Step07/titanic_model.sh: -------------------------------------------------------------------------------- 1 | python3.8 ./titanic_model.py $1 2 | -------------------------------------------------------------------------------- /Step08/README07.md: -------------------------------------------------------------------------------- 1 | ### Step 07: Write testing dataloader 2 | 3 | - Write a class that loads the required data from files 4 | - Same interface as `SqlLoader` 5 | - Add a "real" loader to it as a property 6 | 7 | This will allow the test context to work without DB connection and still have the DB as a fallback when you run it for the first time. For `TitanicModelCreator` the two loaders are indistinguishable as they have the same interface. 8 | -------------------------------------------------------------------------------- /Step08/make_venv.sh: -------------------------------------------------------------------------------- 1 | python3.8 -m venv .venv 2 | source .venv/bin/activate 3 | pip3 install --upgrade pip 4 | pip3 install setuptools==57.1.0 5 | pip3 install wheel 6 | pip3 install -r requirements.txt 7 | -------------------------------------------------------------------------------- /Step08/requirements.txt: -------------------------------------------------------------------------------- 1 | typer 2 | scikit-learn 3 | jupyter 4 | pandas 5 | numpy 6 | sqlalchemy 7 | pydantic 8 | -------------------------------------------------------------------------------- /Step08/titanic_model.py: -------------------------------------------------------------------------------- 1 | import os 2 | import pickle 3 | import typer 4 | import numpy as np 5 | import pandas as pd 6 | from collections import Counter 7 | from sqlalchemy import create_engine 8 | 9 | from sklearn.model_selection import train_test_split 10 | from sklearn.linear_model import LogisticRegression 11 | from sklearn.preprocessing import RobustScaler 12 | from sklearn.preprocessing import OneHotEncoder 13 | from sklearn.impute import KNNImputer 14 | from sklearn.metrics import confusion_matrix 15 | 16 | 17 | def do_test(filename, data): 18 | if not os.path.isfile(filename): 19 | pickle.dump(data, open(filename, 'wb')) 20 | truth = pickle.load(open(filename, 'rb')) 21 | try: 22 | np.testing.assert_almost_equal(data, truth) 23 | print(f'{filename} test passed') 24 | except AssertionError as ex: 25 | print(f'{filename} test failed {ex}') 26 | 27 | 28 | def do_pandas_test(filename, data): 29 | if not os.path.isfile(filename): 30 | data.to_pickle(filename) 31 | truth = pd.read_pickle(filename) 32 | try: 33 | pd.testing.assert_frame_equal(data, truth) 34 | print(f'{filename} pandas test passed') 35 | except AssertionError as ex: 36 | print(f'{filename} pandas test failed {ex}') 37 | 38 | 39 | class SqlLoader: 40 | def __init__(self, connection_string): 41 | engine = create_engine(connection_string) 42 | self.connection = engine.connect() 43 | 44 | def get_passengers(self): 45 | query = 'SELECT * FROM tbl_passengers' 46 | return pd.read_sql(query, con=self.connection) 47 | 48 | def get_targets(self): 49 | query = 'SELECT * FROM tbl_targets' 50 | return pd.read_sql(query, con=self.connection) 51 | 52 | 53 | class TestLoader: 54 | def __init__(self, passengers_filename, targets_filename, real_loader): 55 | self.passengers_filename = passengers_filename 56 | self.targets_filename = targets_filename 57 | self.real_loader = real_loader 58 | if not os.path.isfile(self.passengers_filename): 59 | df = self.real_loader.get_passengers() 60 | df.to_pickle(self.passengers_filename) 61 | if not os.path.isfile(self.targets_filename): 62 | df = self.real_loader.get_targets() 63 | df.to_pickle(self.targets_filename) 64 | 65 | def get_passengers(self): 66 | return pd.read_pickle(self.passengers_filename) 67 | 68 | def get_targets(self): 69 | return pd.read_pickle(self.targets_filename) 70 | 71 | 72 | class TitanicModelCreator: 73 | def __init__(self, loader): 74 | self.loader = loader 75 | np.random.seed(42) 76 | 77 | def run(self): 78 | df = self.loader.get_passengers() 79 | targets = self.loader.get_targets() 80 | 81 | # parch = Parents/Children, sibsp = Siblings/Spouses 82 | df['family_size'] = df['parch'] + df['sibsp'] 83 | df['is_alone'] = [ 84 | 1 if family_size == 1 else 0 for family_size in df['family_size'] 85 | ] 86 | 87 | df['title'] = [name.split(',')[1].split('.')[0].strip() for name in df['name']] 88 | rare_titles = {k for k, v in Counter(df['title']).items() if v < 10} 89 | df['title'] = [ 90 | 'rare' if title in rare_titles else title for title in df['title'] 91 | ] 92 | 93 | df = df[ 94 | [ 95 | 'pclass', 96 | 'sex', 97 | 'age', 98 | 'ticket', 99 | 'family_size', 100 | 'fare', 101 | 'embarked', 102 | 'is_alone', 103 | 'title', 104 | ] 105 | ] 106 | 107 | targets = [int(v) for v in targets['is_survived']] 108 | X_train, X_test, y_train, y_test = train_test_split( 109 | df, targets, stratify=targets, test_size=0.2 110 | ) 111 | 112 | X_train_categorical = X_train[ 113 | ['embarked', 'sex', 'pclass', 'title', 'is_alone'] 114 | ] 115 | X_test_categorical = X_test[['embarked', 'sex', 'pclass', 'title', 'is_alone']] 116 | 117 | one_hot_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False).fit( 118 | X_train_categorical 119 | ) 120 | X_train_categorical_one_hot = one_hot_encoder.transform(X_train_categorical) 121 | X_test_categorical_one_hot = one_hot_encoder.transform(X_test_categorical) 122 | 123 | X_train_numerical = X_train[['age', 'fare', 'family_size']] 124 | X_test_numerical = X_test[['age', 'fare', 'family_size']] 125 | knn_imputer = KNNImputer(n_neighbors=5).fit(X_train_numerical) 126 | X_train_numerical_imputed = knn_imputer.transform(X_train_numerical) 127 | X_test_numerical_imputed = knn_imputer.transform(X_test_numerical) 128 | 129 | robust_scaler = RobustScaler().fit(X_train_numerical_imputed) 130 | X_train_numerical_imputed_scaled = robust_scaler.transform( 131 | X_train_numerical_imputed 132 | ) 133 | X_test_numerical_imputed_scaled = robust_scaler.transform( 134 | X_test_numerical_imputed 135 | ) 136 | 137 | X_train_processed = np.hstack( 138 | (X_train_categorical_one_hot, X_train_numerical_imputed_scaled) 139 | ) 140 | X_test_processed = np.hstack( 141 | (X_test_categorical_one_hot, X_test_numerical_imputed_scaled) 142 | ) 143 | 144 | model = LogisticRegression(random_state=0).fit(X_train_processed, y_train) 145 | y_train_estimation = model.predict(X_train_processed) 146 | y_test_estimation = model.predict(X_test_processed) 147 | 148 | cm_train = confusion_matrix(y_train, y_train_estimation) 149 | 150 | cm_test = confusion_matrix(y_test, y_test_estimation) 151 | 152 | print('cm_train', cm_train) 153 | print('cm_test', cm_test) 154 | 155 | do_test('../data/cm_test.pkl', cm_test) 156 | do_test('../data/cm_train.pkl', cm_train) 157 | do_test('../data/X_train_processed.pkl', X_train_processed) 158 | do_test('../data/X_test_processed.pkl', X_test_processed) 159 | 160 | do_pandas_test('../data/df.pkl', df) 161 | 162 | 163 | def main(param: str = 'pass'): 164 | titanic_model_creator = TitanicModelCreator( 165 | loader=SqlLoader(connection_string='sqlite:///../data/titanic.db') 166 | ) 167 | titanic_model_creator.run() 168 | 169 | 170 | if __name__ == "__main__": 171 | typer.run(main) 172 | -------------------------------------------------------------------------------- /Step08/titanic_model.sh: -------------------------------------------------------------------------------- 1 | python3.8 ./titanic_model.py $1 2 | -------------------------------------------------------------------------------- /Step09/README08.md: -------------------------------------------------------------------------------- 1 | ### Step 08: Write the test context 2 | 3 | - Create `test_main` function 4 | - Make sure `typer` calls that in `typer.run()` 5 | - Copy the code from `main` to `test_main` 6 | - Replace `SqlLoader` in it with `TestLoader` 7 | 8 | From now on this is the only code that is tested. The costly connection to the DB is replaced with a file load. Also if it is still not fast enough, additional parameter can reduce the amount of data in the test to make the process faster. [How can a Data Scientist refactor Jupyter notebooks towards production-quality code?](https://laszlo.substack.com/p/how-can-a-data-scientist-refactor) [I appreciate this might be terse. Comment, open an issue, vote on it if you would like to have a detailed discussion on this - Laszlo] 9 | 10 | This is the essence of the importance of Clean Architecture and code reuse. Every code will be used in two different context: test and "production" by injecting different dependencies. Because the same code runs in both places there is no time spent on translating from one to another. The test setup should reflect production context as close as possible so when a test fail or pass you can think that the same will happen in production as well. This speed up iteration because you can freely experiment in the test context and only deploy code into "production" when you are convinced it is doing what you think it should do. But it is the same code, so deployment is effortless. 11 | -------------------------------------------------------------------------------- /Step09/make_venv.sh: -------------------------------------------------------------------------------- 1 | python3.8 -m venv .venv 2 | source .venv/bin/activate 3 | pip3 install --upgrade pip 4 | pip3 install setuptools==57.1.0 5 | pip3 install wheel 6 | pip3 install -r requirements.txt 7 | -------------------------------------------------------------------------------- /Step09/requirements.txt: -------------------------------------------------------------------------------- 1 | typer 2 | scikit-learn 3 | jupyter 4 | pandas 5 | numpy 6 | sqlalchemy 7 | pydantic 8 | -------------------------------------------------------------------------------- /Step09/titanic_model.py: -------------------------------------------------------------------------------- 1 | import os 2 | import pickle 3 | import typer 4 | import numpy as np 5 | import pandas as pd 6 | from collections import Counter 7 | from sqlalchemy import create_engine 8 | 9 | from sklearn.model_selection import train_test_split 10 | from sklearn.linear_model import LogisticRegression 11 | from sklearn.preprocessing import RobustScaler 12 | from sklearn.preprocessing import OneHotEncoder 13 | from sklearn.impute import KNNImputer 14 | from sklearn.metrics import confusion_matrix 15 | 16 | 17 | def do_test(filename, data): 18 | if not os.path.isfile(filename): 19 | pickle.dump(data, open(filename, 'wb')) 20 | truth = pickle.load(open(filename, 'rb')) 21 | try: 22 | np.testing.assert_almost_equal(data, truth) 23 | print(f'{filename} test passed') 24 | except AssertionError as ex: 25 | print(f'{filename} test failed {ex}') 26 | 27 | 28 | def do_pandas_test(filename, data): 29 | if not os.path.isfile(filename): 30 | data.to_pickle(filename) 31 | truth = pd.read_pickle(filename) 32 | try: 33 | pd.testing.assert_frame_equal(data, truth) 34 | print(f'{filename} pandas test passed') 35 | except AssertionError as ex: 36 | print(f'{filename} pandas test failed {ex}') 37 | 38 | 39 | class SqlLoader: 40 | def __init__(self, connection_string): 41 | engine = create_engine(connection_string) 42 | self.connection = engine.connect() 43 | 44 | def get_passengers(self): 45 | query = 'SELECT * FROM tbl_passengers' 46 | return pd.read_sql(query, con=self.connection) 47 | 48 | def get_targets(self): 49 | query = 'SELECT * FROM tbl_targets' 50 | return pd.read_sql(query, con=self.connection) 51 | 52 | 53 | class TestLoader: 54 | def __init__(self, passengers_filename, targets_filename, real_loader): 55 | self.passengers_filename = passengers_filename 56 | self.targets_filename = targets_filename 57 | self.real_loader = real_loader 58 | if not os.path.isfile(self.passengers_filename): 59 | df = self.real_loader.get_passengers() 60 | df.to_pickle(self.passengers_filename) 61 | if not os.path.isfile(self.targets_filename): 62 | df = self.real_loader.get_targets() 63 | df.to_pickle(self.targets_filename) 64 | 65 | def get_passengers(self): 66 | return pd.read_pickle(self.passengers_filename) 67 | 68 | def get_targets(self): 69 | return pd.read_pickle(self.targets_filename) 70 | 71 | 72 | class TitanicModelCreator: 73 | def __init__(self, loader): 74 | self.loader = loader 75 | np.random.seed(42) 76 | 77 | def run(self): 78 | df = self.loader.get_passengers() 79 | targets = self.loader.get_targets() 80 | 81 | # parch = Parents/Children, sibsp = Siblings/Spouses 82 | df['family_size'] = df['parch'] + df['sibsp'] 83 | df['is_alone'] = [ 84 | 1 if family_size == 1 else 0 for family_size in df['family_size'] 85 | ] 86 | 87 | df['title'] = [name.split(',')[1].split('.')[0].strip() for name in df['name']] 88 | rare_titles = {k for k, v in Counter(df['title']).items() if v < 10} 89 | df['title'] = [ 90 | 'rare' if title in rare_titles else title for title in df['title'] 91 | ] 92 | 93 | df = df[ 94 | [ 95 | 'pclass', 96 | 'sex', 97 | 'age', 98 | 'ticket', 99 | 'family_size', 100 | 'fare', 101 | 'embarked', 102 | 'is_alone', 103 | 'title', 104 | ] 105 | ] 106 | 107 | targets = [int(v) for v in targets['is_survived']] 108 | X_train, X_test, y_train, y_test = train_test_split( 109 | df, targets, stratify=targets, test_size=0.2 110 | ) 111 | 112 | X_train_categorical = X_train[ 113 | ['embarked', 'sex', 'pclass', 'title', 'is_alone'] 114 | ] 115 | X_test_categorical = X_test[['embarked', 'sex', 'pclass', 'title', 'is_alone']] 116 | 117 | one_hot_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False).fit( 118 | X_train_categorical 119 | ) 120 | X_train_categorical_one_hot = one_hot_encoder.transform(X_train_categorical) 121 | X_test_categorical_one_hot = one_hot_encoder.transform(X_test_categorical) 122 | 123 | X_train_numerical = X_train[['age', 'fare', 'family_size']] 124 | X_test_numerical = X_test[['age', 'fare', 'family_size']] 125 | knn_imputer = KNNImputer(n_neighbors=5).fit(X_train_numerical) 126 | X_train_numerical_imputed = knn_imputer.transform(X_train_numerical) 127 | X_test_numerical_imputed = knn_imputer.transform(X_test_numerical) 128 | 129 | robust_scaler = RobustScaler().fit(X_train_numerical_imputed) 130 | X_train_numerical_imputed_scaled = robust_scaler.transform( 131 | X_train_numerical_imputed 132 | ) 133 | X_test_numerical_imputed_scaled = robust_scaler.transform( 134 | X_test_numerical_imputed 135 | ) 136 | 137 | X_train_processed = np.hstack( 138 | (X_train_categorical_one_hot, X_train_numerical_imputed_scaled) 139 | ) 140 | X_test_processed = np.hstack( 141 | (X_test_categorical_one_hot, X_test_numerical_imputed_scaled) 142 | ) 143 | 144 | model = LogisticRegression(random_state=0).fit(X_train_processed, y_train) 145 | y_train_estimation = model.predict(X_train_processed) 146 | y_test_estimation = model.predict(X_test_processed) 147 | 148 | cm_train = confusion_matrix(y_train, y_train_estimation) 149 | 150 | cm_test = confusion_matrix(y_test, y_test_estimation) 151 | 152 | print('cm_train', cm_train) 153 | print('cm_test', cm_test) 154 | 155 | do_test('../data/cm_test.pkl', cm_test) 156 | do_test('../data/cm_train.pkl', cm_train) 157 | do_test('../data/X_train_processed.pkl', X_train_processed) 158 | do_test('../data/X_test_processed.pkl', X_test_processed) 159 | 160 | do_pandas_test('../data/df.pkl', df) 161 | 162 | 163 | def main(param: str = 'pass'): 164 | titanic_model_creator = TitanicModelCreator( 165 | loader=SqlLoader(connection_string='sqlite:///../data/titanic.db') 166 | ) 167 | titanic_model_creator.run() 168 | 169 | 170 | def test_main(param: str = 'pass'): 171 | titanic_model_creator = TitanicModelCreator( 172 | loader=TestLoader( 173 | passengers_filename='../data/passengers.pkl', 174 | targets_filename='../data/targets.pkl', 175 | real_loader=SqlLoader(connection_string='sqlite:///../data/titanic.db'), 176 | ) 177 | ) 178 | titanic_model_creator.run() 179 | 180 | 181 | if __name__ == "__main__": 182 | typer.run(test_main) 183 | -------------------------------------------------------------------------------- /Step09/titanic_model.sh: -------------------------------------------------------------------------------- 1 | python3.8 ./titanic_model.py $1 2 | -------------------------------------------------------------------------------- /Step10/README09.md: -------------------------------------------------------------------------------- 1 | ### Step 09: Merge passenger data with targets 2 | 3 | - Remove the `get_targets()` interface 4 | - Replace the query in `SqlLoader` 5 | - Remove any code related to `targets` 6 | 7 | This is a step to prepare to build the "domain data model". The Titanic model is about survival of her passengers. For the code to align this domain the concept of "passengers" need to be introduced (as a class/object). A passenger either survived or not, it's an attribute of the passenger and it need to be implemented like that. 8 | 9 | This is a critical part of the code quality journey and building better systems. Once you introduce these concepts your code will depend directly on the business problem you are solving not the various representations the data is stored (pandas, numpy, csv, etc). I wrote about this many times on my blog: 10 | 11 | - [3 Ways Domain Data Models help Data Science Projects](https://laszlo.substack.com/p/3-ways-domain-data-models-help-data) 12 | - [Clean Architecture: How to structure your ML projects to reduce technical debt](https://laszlo.substack.com/p/slides-for-my-talk-at-pydata-london) 13 | - [How did I change my mind about dataclasses in ML projects?](https://laszlo.substack.com/p/how-did-i-change-my-mind-about-dataclasses) 14 | -------------------------------------------------------------------------------- /Step10/make_venv.sh: -------------------------------------------------------------------------------- 1 | python3.8 -m venv .venv 2 | source .venv/bin/activate 3 | pip3 install --upgrade pip 4 | pip3 install setuptools==57.1.0 5 | pip3 install wheel 6 | pip3 install -r requirements.txt 7 | -------------------------------------------------------------------------------- /Step10/requirements.txt: -------------------------------------------------------------------------------- 1 | typer 2 | scikit-learn 3 | jupyter 4 | pandas 5 | numpy 6 | sqlalchemy 7 | pydantic 8 | -------------------------------------------------------------------------------- /Step10/titanic_model.py: -------------------------------------------------------------------------------- 1 | import os 2 | import pickle 3 | import typer 4 | import numpy as np 5 | import pandas as pd 6 | from collections import Counter 7 | from sqlalchemy import create_engine 8 | 9 | from sklearn.model_selection import train_test_split 10 | from sklearn.linear_model import LogisticRegression 11 | from sklearn.preprocessing import RobustScaler 12 | from sklearn.preprocessing import OneHotEncoder 13 | from sklearn.impute import KNNImputer 14 | from sklearn.metrics import confusion_matrix 15 | 16 | 17 | def do_test(filename, data): 18 | if not os.path.isfile(filename): 19 | pickle.dump(data, open(filename, 'wb')) 20 | truth = pickle.load(open(filename, 'rb')) 21 | try: 22 | np.testing.assert_almost_equal(data, truth) 23 | print(f'{filename} test passed') 24 | except AssertionError as ex: 25 | print(f'{filename} test failed {ex}') 26 | 27 | 28 | def do_pandas_test(filename, data): 29 | if not os.path.isfile(filename): 30 | data.to_pickle(filename) 31 | truth = pd.read_pickle(filename) 32 | try: 33 | pd.testing.assert_frame_equal(data, truth) 34 | print(f'{filename} pandas test passed') 35 | except AssertionError as ex: 36 | print(f'{filename} pandas test failed {ex}') 37 | 38 | 39 | class SqlLoader: 40 | def __init__(self, connection_string): 41 | engine = create_engine(connection_string) 42 | self.connection = engine.connect() 43 | 44 | def get_passengers(self): 45 | query = """ 46 | SELECT 47 | tbl_passengers.*, 48 | tbl_targets.is_survived 49 | FROM 50 | tbl_passengers 51 | JOIN 52 | tbl_targets 53 | ON 54 | tbl_passengers.pid=tbl_targets.pid 55 | """ 56 | return pd.read_sql(query, con=self.connection) 57 | 58 | 59 | class TestLoader: 60 | def __init__(self, passengers_filename, real_loader): 61 | self.passengers_filename = passengers_filename 62 | self.real_loader = real_loader 63 | if not os.path.isfile(self.passengers_filename): 64 | df = self.real_loader.get_passengers() 65 | df.to_pickle(self.passengers_filename) 66 | 67 | def get_passengers(self): 68 | return pd.read_pickle(self.passengers_filename) 69 | 70 | 71 | class TitanicModelCreator: 72 | def __init__(self, loader): 73 | self.loader = loader 74 | np.random.seed(42) 75 | 76 | def run(self): 77 | df = self.loader.get_passengers() 78 | 79 | # parch = Parents/Children, sibsp = Siblings/Spouses 80 | df['family_size'] = df['parch'] + df['sibsp'] 81 | df['is_alone'] = [ 82 | 1 if family_size == 1 else 0 for family_size in df['family_size'] 83 | ] 84 | 85 | df['title'] = [name.split(',')[1].split('.')[0].strip() for name in df['name']] 86 | rare_titles = {k for k, v in Counter(df['title']).items() if v < 10} 87 | df['title'] = [ 88 | 'rare' if title in rare_titles else title for title in df['title'] 89 | ] 90 | 91 | targets = [int(v) for v in df['is_survived']] 92 | df = df[ 93 | [ 94 | 'pclass', 95 | 'sex', 96 | 'age', 97 | 'ticket', 98 | 'family_size', 99 | 'fare', 100 | 'embarked', 101 | 'is_alone', 102 | 'title', 103 | ] 104 | ] 105 | 106 | X_train, X_test, y_train, y_test = train_test_split( 107 | df, targets, stratify=targets, test_size=0.2 108 | ) 109 | 110 | X_train_categorical = X_train[ 111 | ['embarked', 'sex', 'pclass', 'title', 'is_alone'] 112 | ] 113 | X_test_categorical = X_test[['embarked', 'sex', 'pclass', 'title', 'is_alone']] 114 | 115 | one_hot_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False).fit( 116 | X_train_categorical 117 | ) 118 | X_train_categorical_one_hot = one_hot_encoder.transform(X_train_categorical) 119 | X_test_categorical_one_hot = one_hot_encoder.transform(X_test_categorical) 120 | 121 | X_train_numerical = X_train[['age', 'fare', 'family_size']] 122 | X_test_numerical = X_test[['age', 'fare', 'family_size']] 123 | knn_imputer = KNNImputer(n_neighbors=5).fit(X_train_numerical) 124 | X_train_numerical_imputed = knn_imputer.transform(X_train_numerical) 125 | X_test_numerical_imputed = knn_imputer.transform(X_test_numerical) 126 | 127 | robust_scaler = RobustScaler().fit(X_train_numerical_imputed) 128 | X_train_numerical_imputed_scaled = robust_scaler.transform( 129 | X_train_numerical_imputed 130 | ) 131 | X_test_numerical_imputed_scaled = robust_scaler.transform( 132 | X_test_numerical_imputed 133 | ) 134 | 135 | X_train_processed = np.hstack( 136 | (X_train_categorical_one_hot, X_train_numerical_imputed_scaled) 137 | ) 138 | X_test_processed = np.hstack( 139 | (X_test_categorical_one_hot, X_test_numerical_imputed_scaled) 140 | ) 141 | 142 | model = LogisticRegression(random_state=0).fit(X_train_processed, y_train) 143 | y_train_estimation = model.predict(X_train_processed) 144 | y_test_estimation = model.predict(X_test_processed) 145 | 146 | cm_train = confusion_matrix(y_train, y_train_estimation) 147 | 148 | cm_test = confusion_matrix(y_test, y_test_estimation) 149 | 150 | print('cm_train', cm_train) 151 | print('cm_test', cm_test) 152 | 153 | do_test('../data/cm_test.pkl', cm_test) 154 | do_test('../data/cm_train.pkl', cm_train) 155 | do_test('../data/X_train_processed.pkl', X_train_processed) 156 | do_test('../data/X_test_processed.pkl', X_test_processed) 157 | 158 | do_pandas_test('../data/df.pkl', df) 159 | 160 | 161 | def main(param: str = 'pass'): 162 | titanic_model_creator = TitanicModelCreator( 163 | loader=SqlLoader(connection_string='sqlite:///../data/titanic.db') 164 | ) 165 | titanic_model_creator.run() 166 | 167 | 168 | def test_main(param: str = 'pass'): 169 | titanic_model_creator = TitanicModelCreator( 170 | loader=TestLoader( 171 | passengers_filename='../data/passengers_with_is_survived.pkl', 172 | real_loader=SqlLoader(connection_string='sqlite:///../data/titanic.db'), 173 | ) 174 | ) 175 | titanic_model_creator.run() 176 | 177 | 178 | if __name__ == "__main__": 179 | typer.run(test_main) 180 | -------------------------------------------------------------------------------- /Step10/titanic_model.sh: -------------------------------------------------------------------------------- 1 | python3.8 ./titanic_model.py $1 2 | -------------------------------------------------------------------------------- /Step11/README10.md: -------------------------------------------------------------------------------- 1 | ### Step 10: Create Passenger class 2 | 3 | - Import `BaseModel` from `pydantic` 4 | - Create the class by inspecting: 5 | - The `dtype` of columns used in `df` 6 | - The actual values in `df` 7 | - The names of the columns that are used later in the code 8 | 9 | There is really no shortcut here. In a "real" project defining this class would be the first step, but in legacy you need to deal with it later. The benefit of domain data objects is that any time you use them you can assume they fulfill a set of assumptions. These can be made explicit with `pydantic's` validators. One goal of the refactoring is to make sure that most interaction between classes happen with domain data objects. This simplifies structuring the project, any future data related change has a well defined place to happen. 10 | -------------------------------------------------------------------------------- /Step11/make_venv.sh: -------------------------------------------------------------------------------- 1 | python3.8 -m venv .venv 2 | source .venv/bin/activate 3 | pip3 install --upgrade pip 4 | pip3 install setuptools==57.1.0 5 | pip3 install wheel 6 | pip3 install -r requirements.txt 7 | -------------------------------------------------------------------------------- /Step11/requirements.txt: -------------------------------------------------------------------------------- 1 | typer 2 | scikit-learn 3 | jupyter 4 | pandas 5 | numpy 6 | sqlalchemy 7 | pydantic 8 | -------------------------------------------------------------------------------- /Step11/titanic_model.py: -------------------------------------------------------------------------------- 1 | import os 2 | import pickle 3 | import typer 4 | import numpy as np 5 | import pandas as pd 6 | from pydantic import BaseModel 7 | from collections import Counter 8 | from sqlalchemy import create_engine 9 | 10 | from sklearn.model_selection import train_test_split 11 | from sklearn.linear_model import LogisticRegression 12 | from sklearn.preprocessing import RobustScaler 13 | from sklearn.preprocessing import OneHotEncoder 14 | from sklearn.impute import KNNImputer 15 | from sklearn.metrics import confusion_matrix 16 | 17 | 18 | class Passenger(BaseModel): 19 | pid: int 20 | pclass: int 21 | sex: str 22 | age: float 23 | ticket: str 24 | family_size: int 25 | fare: float 26 | embarked: str 27 | is_alone: int 28 | title: str 29 | is_survived: int 30 | 31 | 32 | # targets = [int(v) for v in df['is_survived']] 33 | # df = df[[ 34 | # 'pclass', 'sex', 'age', 'ticket', 'family_size', 35 | # 'fare', 'embarked', 'is_alone', 'title', 36 | # ]] 37 | 38 | # >>> df[:3].T 39 | # 0 1 2 40 | # pid 0 1 2 41 | # pclass 1.0 1.0 1.0 42 | # name Allen, Miss. Elisabeth Walton Allison, Master. Hudson Trevor Allison, Miss. Helen Loraine 43 | # sex female male female 44 | # age 29.0 0.9167 2.0 45 | # sibsp 0.0 1.0 1.0 46 | # parch 0.0 2.0 2.0 47 | # ticket 24160 113781 113781 48 | # fare 211.3375 151.55 151.55 49 | # cabin B5 C22 C26 C22 C26 50 | # embarked S S S 51 | # boat 2 11 None 52 | # body NaN NaN NaN 53 | # home.dest St Louis, MO Montreal, PQ / Chesterville, ON Montreal, PQ / Chesterville, ON 54 | # is_survived 1 1 0 55 | # >>> df.dtypes 56 | # pid int64 57 | # pclass float64 58 | # name object 59 | # sex object 60 | # age float64 61 | # sibsp float64 62 | # parch float64 63 | # ticket object 64 | # fare float64 65 | # cabin object 66 | # embarked object 67 | # boat object 68 | # body float64 69 | # home.dest object 70 | # is_survived int64 71 | # >>> set(df['pclass']) 72 | # {1.0, 2.0, 3.0} 73 | 74 | 75 | def do_test(filename, data): 76 | if not os.path.isfile(filename): 77 | pickle.dump(data, open(filename, 'wb')) 78 | truth = pickle.load(open(filename, 'rb')) 79 | try: 80 | np.testing.assert_almost_equal(data, truth) 81 | print(f'{filename} test passed') 82 | except AssertionError as ex: 83 | print(f'{filename} test failed {ex}') 84 | 85 | 86 | def do_pandas_test(filename, data): 87 | if not os.path.isfile(filename): 88 | data.to_pickle(filename) 89 | truth = pd.read_pickle(filename) 90 | try: 91 | pd.testing.assert_frame_equal(data, truth) 92 | print(f'{filename} pandas test passed') 93 | except AssertionError as ex: 94 | print(f'{filename} pandas test failed {ex}') 95 | 96 | 97 | class SqlLoader: 98 | def __init__(self, connection_string): 99 | engine = create_engine(connection_string) 100 | self.connection = engine.connect() 101 | 102 | def get_passengers(self): 103 | query = """ 104 | SELECT 105 | tbl_passengers.*, 106 | tbl_targets.is_survived 107 | FROM 108 | tbl_passengers 109 | JOIN 110 | tbl_targets 111 | ON 112 | tbl_passengers.pid=tbl_targets.pid 113 | """ 114 | return pd.read_sql(query, con=self.connection) 115 | 116 | 117 | class TestLoader: 118 | def __init__(self, passengers_filename, real_loader): 119 | self.passengers_filename = passengers_filename 120 | self.real_loader = real_loader 121 | if not os.path.isfile(self.passengers_filename): 122 | df = self.real_loader.get_passengers() 123 | df.to_pickle(self.passengers_filename) 124 | 125 | def get_passengers(self): 126 | return pd.read_pickle(self.passengers_filename) 127 | 128 | 129 | class TitanicModelCreator: 130 | def __init__(self, loader): 131 | self.loader = loader 132 | np.random.seed(42) 133 | 134 | def run(self): 135 | df = self.loader.get_passengers() 136 | 137 | # parch = Parents/Children, sibsp = Siblings/Spouses 138 | df['family_size'] = df['parch'] + df['sibsp'] 139 | df['is_alone'] = [ 140 | 1 if family_size == 1 else 0 for family_size in df['family_size'] 141 | ] 142 | 143 | df['title'] = [name.split(',')[1].split('.')[0].strip() for name in df['name']] 144 | rare_titles = {k for k, v in Counter(df['title']).items() if v < 10} 145 | df['title'] = [ 146 | 'rare' if title in rare_titles else title for title in df['title'] 147 | ] 148 | 149 | targets = [int(v) for v in df['is_survived']] 150 | df = df[ 151 | [ 152 | 'pclass', 153 | 'sex', 154 | 'age', 155 | 'ticket', 156 | 'family_size', 157 | 'fare', 158 | 'embarked', 159 | 'is_alone', 160 | 'title', 161 | ] 162 | ] 163 | 164 | X_train, X_test, y_train, y_test = train_test_split( 165 | df, targets, stratify=targets, test_size=0.2 166 | ) 167 | 168 | X_train_categorical = X_train[ 169 | ['embarked', 'sex', 'pclass', 'title', 'is_alone'] 170 | ] 171 | X_test_categorical = X_test[['embarked', 'sex', 'pclass', 'title', 'is_alone']] 172 | 173 | one_hot_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False).fit( 174 | X_train_categorical 175 | ) 176 | X_train_categorical_one_hot = one_hot_encoder.transform(X_train_categorical) 177 | X_test_categorical_one_hot = one_hot_encoder.transform(X_test_categorical) 178 | 179 | X_train_numerical = X_train[['age', 'fare', 'family_size']] 180 | X_test_numerical = X_test[['age', 'fare', 'family_size']] 181 | knn_imputer = KNNImputer(n_neighbors=5).fit(X_train_numerical) 182 | X_train_numerical_imputed = knn_imputer.transform(X_train_numerical) 183 | X_test_numerical_imputed = knn_imputer.transform(X_test_numerical) 184 | 185 | robust_scaler = RobustScaler().fit(X_train_numerical_imputed) 186 | X_train_numerical_imputed_scaled = robust_scaler.transform( 187 | X_train_numerical_imputed 188 | ) 189 | X_test_numerical_imputed_scaled = robust_scaler.transform( 190 | X_test_numerical_imputed 191 | ) 192 | 193 | X_train_processed = np.hstack( 194 | (X_train_categorical_one_hot, X_train_numerical_imputed_scaled) 195 | ) 196 | X_test_processed = np.hstack( 197 | (X_test_categorical_one_hot, X_test_numerical_imputed_scaled) 198 | ) 199 | 200 | model = LogisticRegression(random_state=0).fit(X_train_processed, y_train) 201 | y_train_estimation = model.predict(X_train_processed) 202 | y_test_estimation = model.predict(X_test_processed) 203 | 204 | cm_train = confusion_matrix(y_train, y_train_estimation) 205 | 206 | cm_test = confusion_matrix(y_test, y_test_estimation) 207 | 208 | print('cm_train', cm_train) 209 | print('cm_test', cm_test) 210 | 211 | do_test('../data/cm_test.pkl', cm_test) 212 | do_test('../data/cm_train.pkl', cm_train) 213 | do_test('../data/X_train_processed.pkl', X_train_processed) 214 | do_test('../data/X_test_processed.pkl', X_test_processed) 215 | 216 | do_pandas_test('../data/df.pkl', df) 217 | 218 | 219 | def main(param: str = 'pass'): 220 | titanic_model_creator = TitanicModelCreator( 221 | loader=SqlLoader(connection_string='sqlite:///../data/titanic.db') 222 | ) 223 | titanic_model_creator.run() 224 | 225 | 226 | def test_main(param: str = 'pass'): 227 | titanic_model_creator = TitanicModelCreator( 228 | loader=TestLoader( 229 | passengers_filename='../data/passengers_with_is_survived.pkl', 230 | real_loader=SqlLoader(connection_string='sqlite:///../data/titanic.db'), 231 | ) 232 | ) 233 | titanic_model_creator.run() 234 | 235 | 236 | if __name__ == "__main__": 237 | typer.run(test_main) 238 | -------------------------------------------------------------------------------- /Step11/titanic_model.sh: -------------------------------------------------------------------------------- 1 | python3.8 ./titanic_model.py $1 2 | -------------------------------------------------------------------------------- /Step12/README11.md: -------------------------------------------------------------------------------- 1 | ### Step 11: Create domain data object based data loader 2 | 3 | - Create `PassengerLoader` class that takes a "real"/"old" loader 4 | - In its `get_passengers` function, load the data from the loader and create the `Passenger` objects 5 | - Copy the data transformations from `TitanicModelCreator.run()` 6 | 7 | Take a look at how the `rare_titles` variable is used in `run()`. After scanning the entire dataset for titles, the ones that appear less than 10 times are selected. This can be done only if you have access to the entire database and this list needs to be maintained. This can cause problems in a real setting when the above operation is too difficult to do. For example if you have millions of items or a constant stream. This kind of dependencies are common in legacy code and one of the goals of refactoring is to identify these and make explicit. Here we will use a constant but in a productionised environment this might need a whole separate service. 8 | 9 | `PassengerLoader` implements the Factory Design Pattern. Factories are classes that create other classes, they are a type of adapter that hides away where the data is coming from and how is it stored and return only abstract domain relevant classes that you can use downstream. Factories are one of two (later increased to three) fundamentally relevant Design Patterns for Data Science workflows: 10 | 11 | - [You only need 2 Design Patterns to improve the quality of your code in a data science project](https://laszlo.substack.com/p/you-only-need-2-design-patterns-to) 12 | - [Clean Architecture: How to structure your ML projects to reduce technical debt](https://laszlo.substack.com/p/slides-for-my-talk-at-pydata-london) 13 | -------------------------------------------------------------------------------- /Step12/make_venv.sh: -------------------------------------------------------------------------------- 1 | python3.8 -m venv .venv 2 | source .venv/bin/activate 3 | pip3 install --upgrade pip 4 | pip3 install setuptools==57.1.0 5 | pip3 install wheel 6 | pip3 install -r requirements.txt 7 | -------------------------------------------------------------------------------- /Step12/requirements.txt: -------------------------------------------------------------------------------- 1 | typer 2 | scikit-learn 3 | jupyter 4 | pandas 5 | numpy 6 | sqlalchemy 7 | pydantic 8 | -------------------------------------------------------------------------------- /Step12/titanic_model.py: -------------------------------------------------------------------------------- 1 | import os 2 | import pickle 3 | import typer 4 | import numpy as np 5 | import pandas as pd 6 | from pydantic import BaseModel 7 | from collections import Counter 8 | from sqlalchemy import create_engine 9 | 10 | from sklearn.model_selection import train_test_split 11 | from sklearn.linear_model import LogisticRegression 12 | from sklearn.preprocessing import RobustScaler 13 | from sklearn.preprocessing import OneHotEncoder 14 | from sklearn.impute import KNNImputer 15 | from sklearn.metrics import confusion_matrix 16 | 17 | 18 | class Passenger(BaseModel): 19 | pid: int 20 | pclass: int 21 | sex: str 22 | age: float 23 | ticket: str 24 | family_size: int 25 | fare: float 26 | embarked: str 27 | is_alone: int 28 | title: str 29 | is_survived: int 30 | 31 | 32 | # targets = [int(v) for v in df['is_survived']] 33 | # df = df[[ 34 | # 'pclass', 'sex', 'age', 'ticket', 'family_size', 35 | # 'fare', 'embarked', 'is_alone', 'title', 36 | # ]] 37 | 38 | # >>> df[:3].T 39 | # 0 1 2 40 | # pid 0 1 2 41 | # pclass 1.0 1.0 1.0 42 | # name Allen, Miss. Elisabeth Walton Allison, Master. Hudson Trevor Allison, Miss. Helen Loraine 43 | # sex female male female 44 | # age 29.0 0.9167 2.0 45 | # sibsp 0.0 1.0 1.0 46 | # parch 0.0 2.0 2.0 47 | # ticket 24160 113781 113781 48 | # fare 211.3375 151.55 151.55 49 | # cabin B5 C22 C26 C22 C26 50 | # embarked S S S 51 | # boat 2 11 None 52 | # body NaN NaN NaN 53 | # home.dest St Louis, MO Montreal, PQ / Chesterville, ON Montreal, PQ / Chesterville, ON 54 | # is_survived 1 1 0 55 | # >>> df.dtypes 56 | # pid int64 57 | # pclass float64 58 | # name object 59 | # sex object 60 | # age float64 61 | # sibsp float64 62 | # parch float64 63 | # ticket object 64 | # fare float64 65 | # cabin object 66 | # embarked object 67 | # boat object 68 | # body float64 69 | # home.dest object 70 | # is_survived int64 71 | # >>> set(df['pclass']) 72 | # {1.0, 2.0, 3.0} 73 | 74 | 75 | def do_test(filename, data): 76 | if not os.path.isfile(filename): 77 | pickle.dump(data, open(filename, 'wb')) 78 | truth = pickle.load(open(filename, 'rb')) 79 | try: 80 | np.testing.assert_almost_equal(data, truth) 81 | print(f'{filename} test passed') 82 | except AssertionError as ex: 83 | print(f'{filename} test failed {ex}') 84 | 85 | 86 | def do_pandas_test(filename, data): 87 | if not os.path.isfile(filename): 88 | data.to_pickle(filename) 89 | truth = pd.read_pickle(filename) 90 | try: 91 | pd.testing.assert_frame_equal(data, truth) 92 | print(f'{filename} pandas test passed') 93 | except AssertionError as ex: 94 | print(f'{filename} pandas test failed {ex}') 95 | 96 | 97 | class SqlLoader: 98 | def __init__(self, connection_string): 99 | engine = create_engine(connection_string) 100 | self.connection = engine.connect() 101 | 102 | def get_passengers(self): 103 | query = """ 104 | SELECT 105 | tbl_passengers.*, 106 | tbl_targets.is_survived 107 | FROM 108 | tbl_passengers 109 | JOIN 110 | tbl_targets 111 | ON 112 | tbl_passengers.pid=tbl_targets.pid 113 | """ 114 | return pd.read_sql(query, con=self.connection) 115 | 116 | 117 | class TestLoader: 118 | def __init__(self, passengers_filename, real_loader): 119 | self.passengers_filename = passengers_filename 120 | self.real_loader = real_loader 121 | if not os.path.isfile(self.passengers_filename): 122 | df = self.real_loader.get_passengers() 123 | df.to_pickle(self.passengers_filename) 124 | 125 | def get_passengers(self): 126 | return pd.read_pickle(self.passengers_filename) 127 | 128 | 129 | class PassengerLoader: 130 | def __init__(self, loader, rare_titles=None): 131 | self.loader = loader 132 | self.rare_titles = rare_titles 133 | 134 | def get_passengers(self): 135 | passengers = [] 136 | for data in self.loader.get_passengers().itertuples(): 137 | # parch = Parents/Children, sibsp = Siblings/Spouses 138 | family_size = int(data.parch + data.sibsp) 139 | # Allen, Miss. Elisabeth Walton 140 | title = data.name.split(',')[1].split('.')[0].strip() 141 | passenger = Passenger( 142 | pid=int(data.pid), 143 | pclass=int(data.pclass), 144 | sex=str(data.sex), 145 | age=float(data.age), 146 | ticket=str(data.ticket), 147 | family_size=family_size, 148 | fare=float(data.fare), 149 | embarked=str(data.embarked), 150 | is_alone=1 if family_size == 1 else 0, 151 | title='rare' if title in self.rare_titles else title, 152 | is_survived=int(data.is_survived), 153 | ) 154 | passengers.append(passenger) 155 | return passengers 156 | 157 | 158 | # Not used: 159 | # cabin object 160 | # boat object 161 | # body float64 162 | # home.dest object 163 | 164 | 165 | class TitanicModelCreator: 166 | def __init__(self, loader): 167 | self.loader = loader 168 | np.random.seed(42) 169 | 170 | def run(self): 171 | df = self.loader.get_passengers() 172 | 173 | # parch = Parents/Children, sibsp = Siblings/Spouses 174 | df['family_size'] = df['parch'] + df['sibsp'] 175 | df['is_alone'] = [ 176 | 1 if family_size == 1 else 0 for family_size in df['family_size'] 177 | ] 178 | 179 | df['title'] = [name.split(',')[1].split('.')[0].strip() for name in df['name']] 180 | rare_titles = {k for k, v in Counter(df['title']).items() if v < 10} 181 | df['title'] = [ 182 | 'rare' if title in rare_titles else title for title in df['title'] 183 | ] 184 | 185 | targets = [int(v) for v in df['is_survived']] 186 | df = df[ 187 | [ 188 | 'pclass', 189 | 'sex', 190 | 'age', 191 | 'ticket', 192 | 'family_size', 193 | 'fare', 194 | 'embarked', 195 | 'is_alone', 196 | 'title', 197 | ] 198 | ] 199 | 200 | X_train, X_test, y_train, y_test = train_test_split( 201 | df, targets, stratify=targets, test_size=0.2 202 | ) 203 | 204 | X_train_categorical = X_train[ 205 | ['embarked', 'sex', 'pclass', 'title', 'is_alone'] 206 | ] 207 | X_test_categorical = X_test[['embarked', 'sex', 'pclass', 'title', 'is_alone']] 208 | 209 | one_hot_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False).fit( 210 | X_train_categorical 211 | ) 212 | X_train_categorical_one_hot = one_hot_encoder.transform(X_train_categorical) 213 | X_test_categorical_one_hot = one_hot_encoder.transform(X_test_categorical) 214 | 215 | X_train_numerical = X_train[['age', 'fare', 'family_size']] 216 | X_test_numerical = X_test[['age', 'fare', 'family_size']] 217 | knn_imputer = KNNImputer(n_neighbors=5).fit(X_train_numerical) 218 | X_train_numerical_imputed = knn_imputer.transform(X_train_numerical) 219 | X_test_numerical_imputed = knn_imputer.transform(X_test_numerical) 220 | 221 | robust_scaler = RobustScaler().fit(X_train_numerical_imputed) 222 | X_train_numerical_imputed_scaled = robust_scaler.transform( 223 | X_train_numerical_imputed 224 | ) 225 | X_test_numerical_imputed_scaled = robust_scaler.transform( 226 | X_test_numerical_imputed 227 | ) 228 | 229 | X_train_processed = np.hstack( 230 | (X_train_categorical_one_hot, X_train_numerical_imputed_scaled) 231 | ) 232 | X_test_processed = np.hstack( 233 | (X_test_categorical_one_hot, X_test_numerical_imputed_scaled) 234 | ) 235 | 236 | model = LogisticRegression(random_state=0).fit(X_train_processed, y_train) 237 | y_train_estimation = model.predict(X_train_processed) 238 | y_test_estimation = model.predict(X_test_processed) 239 | 240 | cm_train = confusion_matrix(y_train, y_train_estimation) 241 | 242 | cm_test = confusion_matrix(y_test, y_test_estimation) 243 | 244 | print('cm_train', cm_train) 245 | print('cm_test', cm_test) 246 | 247 | do_test('../data/cm_test.pkl', cm_test) 248 | do_test('../data/cm_train.pkl', cm_train) 249 | do_test('../data/X_train_processed.pkl', X_train_processed) 250 | do_test('../data/X_test_processed.pkl', X_test_processed) 251 | 252 | do_pandas_test('../data/df.pkl', df) 253 | 254 | 255 | def main(param: str = 'pass'): 256 | titanic_model_creator = TitanicModelCreator( 257 | loader=SqlLoader(connection_string='sqlite:///../data/titanic.db') 258 | ) 259 | titanic_model_creator.run() 260 | 261 | 262 | def test_main(param: str = 'pass'): 263 | titanic_model_creator = TitanicModelCreator( 264 | loader=TestLoader( 265 | passengers_filename='../data/passengers_with_is_survived.pkl', 266 | real_loader=SqlLoader(connection_string='sqlite:///../data/titanic.db'), 267 | ) 268 | ) 269 | titanic_model_creator.run() 270 | 271 | 272 | if __name__ == "__main__": 273 | typer.run(test_main) 274 | -------------------------------------------------------------------------------- /Step12/titanic_model.sh: -------------------------------------------------------------------------------- 1 | python3.8 ./titanic_model.py $1 2 | -------------------------------------------------------------------------------- /Step13/README12.md: -------------------------------------------------------------------------------- 1 | ### Step 12: Remove any data that is not explicitly needed 2 | 3 | - Update the query in `SqlLoader` to only retrieve the columns that will be used for the model's input 4 | 5 | Simplifying down to the minimum is a goal of refactoring. Anything that is not explicitly needed should be removed. If the requirements change they can be added back again. For example the `ticket` column is in `df` but it is never used again in the program. Remove it. 6 | -------------------------------------------------------------------------------- /Step13/make_venv.sh: -------------------------------------------------------------------------------- 1 | python3.8 -m venv .venv 2 | source .venv/bin/activate 3 | pip3 install --upgrade pip 4 | pip3 install setuptools==57.1.0 5 | pip3 install wheel 6 | pip3 install -r requirements.txt 7 | -------------------------------------------------------------------------------- /Step13/requirements.txt: -------------------------------------------------------------------------------- 1 | typer 2 | scikit-learn 3 | jupyter 4 | pandas 5 | numpy 6 | sqlalchemy 7 | pydantic 8 | -------------------------------------------------------------------------------- /Step13/titanic_model.py: -------------------------------------------------------------------------------- 1 | import os 2 | import pickle 3 | import typer 4 | import numpy as np 5 | import pandas as pd 6 | from pydantic import BaseModel 7 | from collections import Counter 8 | from sqlalchemy import create_engine 9 | 10 | from sklearn.model_selection import train_test_split 11 | from sklearn.linear_model import LogisticRegression 12 | from sklearn.preprocessing import RobustScaler 13 | from sklearn.preprocessing import OneHotEncoder 14 | from sklearn.impute import KNNImputer 15 | from sklearn.metrics import confusion_matrix 16 | 17 | 18 | class Passenger(BaseModel): 19 | pid: int 20 | pclass: int 21 | sex: str 22 | age: float 23 | family_size: int 24 | fare: float 25 | embarked: str 26 | is_alone: int 27 | title: str 28 | is_survived: int 29 | 30 | 31 | def do_test(filename, data): 32 | if not os.path.isfile(filename): 33 | pickle.dump(data, open(filename, 'wb')) 34 | truth = pickle.load(open(filename, 'rb')) 35 | try: 36 | np.testing.assert_almost_equal(data, truth) 37 | print(f'{filename} test passed') 38 | except AssertionError as ex: 39 | print(f'{filename} test failed {ex}') 40 | 41 | 42 | def do_pandas_test(filename, data): 43 | if not os.path.isfile(filename): 44 | data.to_pickle(filename) 45 | truth = pd.read_pickle(filename) 46 | try: 47 | pd.testing.assert_frame_equal(data, truth) 48 | print(f'{filename} pandas test passed') 49 | except AssertionError as ex: 50 | print(f'{filename} pandas test failed {ex}') 51 | 52 | 53 | class SqlLoader: 54 | def __init__(self, connection_string): 55 | engine = create_engine(connection_string) 56 | self.connection = engine.connect() 57 | 58 | def get_passengers(self): 59 | query = """ 60 | SELECT 61 | tbl_passengers.pid, 62 | tbl_passengers.pclass, 63 | tbl_passengers.sex, 64 | tbl_passengers.age, 65 | tbl_passengers.parch, 66 | tbl_passengers.sibsp, 67 | tbl_passengers.fare, 68 | tbl_passengers.embarked, 69 | tbl_passengers.name, 70 | tbl_targets.is_survived 71 | FROM 72 | tbl_passengers 73 | JOIN 74 | tbl_targets 75 | ON 76 | tbl_passengers.pid=tbl_targets.pid 77 | """ 78 | return pd.read_sql(query, con=self.connection) 79 | 80 | 81 | class TestLoader: 82 | def __init__(self, passengers_filename, real_loader): 83 | self.passengers_filename = passengers_filename 84 | self.real_loader = real_loader 85 | if not os.path.isfile(self.passengers_filename): 86 | df = self.real_loader.get_passengers() 87 | df.to_pickle(self.passengers_filename) 88 | 89 | def get_passengers(self): 90 | return pd.read_pickle(self.passengers_filename) 91 | 92 | 93 | class PassengerLoader: 94 | def __init__(self, loader, rare_titles=None): 95 | self.loader = loader 96 | self.rare_titles = rare_titles 97 | 98 | def get_passengers(self): 99 | passengers = [] 100 | for data in self.loader.get_passengers().itertuples(): 101 | # parch = Parents/Children, sibsp = Siblings/Spouses 102 | family_size = int(data.parch + data.sibsp) 103 | # Allen, Miss. Elisabeth Walton 104 | title = data.name.split(',')[1].split('.')[0].strip() 105 | passenger = Passenger( 106 | pid=int(data.pid), 107 | pclass=int(data.pclass), 108 | sex=str(data.sex), 109 | age=float(data.age), 110 | family_size=family_size, 111 | fare=float(data.fare), 112 | embarked=str(data.embarked), 113 | is_alone=1 if family_size == 1 else 0, 114 | title='rare' if title in self.rare_titles else title, 115 | is_survived=int(data.is_survived), 116 | ) 117 | passengers.append(passenger) 118 | return passengers 119 | 120 | 121 | class TitanicModelCreator: 122 | def __init__(self, loader): 123 | self.loader = loader 124 | np.random.seed(42) 125 | 126 | def run(self): 127 | df = self.loader.get_passengers() 128 | 129 | # parch = Parents/Children, sibsp = Siblings/Spouses 130 | df['family_size'] = df['parch'] + df['sibsp'] 131 | df['is_alone'] = [ 132 | 1 if family_size == 1 else 0 for family_size in df['family_size'] 133 | ] 134 | 135 | df['title'] = [name.split(',')[1].split('.')[0].strip() for name in df['name']] 136 | rare_titles = {k for k, v in Counter(df['title']).items() if v < 10} 137 | df['title'] = [ 138 | 'rare' if title in rare_titles else title for title in df['title'] 139 | ] 140 | 141 | targets = [int(v) for v in df['is_survived']] 142 | df = df[ 143 | [ 144 | 'pclass', 145 | 'sex', 146 | 'age', 147 | 'family_size', 148 | 'fare', 149 | 'embarked', 150 | 'is_alone', 151 | 'title', 152 | ] 153 | ] 154 | 155 | X_train, X_test, y_train, y_test = train_test_split( 156 | df, targets, stratify=targets, test_size=0.2 157 | ) 158 | 159 | X_train_categorical = X_train[ 160 | ['embarked', 'sex', 'pclass', 'title', 'is_alone'] 161 | ] 162 | X_test_categorical = X_test[['embarked', 'sex', 'pclass', 'title', 'is_alone']] 163 | 164 | one_hot_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False).fit( 165 | X_train_categorical 166 | ) 167 | X_train_categorical_one_hot = one_hot_encoder.transform(X_train_categorical) 168 | X_test_categorical_one_hot = one_hot_encoder.transform(X_test_categorical) 169 | 170 | X_train_numerical = X_train[['age', 'fare', 'family_size']] 171 | X_test_numerical = X_test[['age', 'fare', 'family_size']] 172 | knn_imputer = KNNImputer(n_neighbors=5).fit(X_train_numerical) 173 | X_train_numerical_imputed = knn_imputer.transform(X_train_numerical) 174 | X_test_numerical_imputed = knn_imputer.transform(X_test_numerical) 175 | 176 | robust_scaler = RobustScaler().fit(X_train_numerical_imputed) 177 | X_train_numerical_imputed_scaled = robust_scaler.transform( 178 | X_train_numerical_imputed 179 | ) 180 | X_test_numerical_imputed_scaled = robust_scaler.transform( 181 | X_test_numerical_imputed 182 | ) 183 | 184 | X_train_processed = np.hstack( 185 | (X_train_categorical_one_hot, X_train_numerical_imputed_scaled) 186 | ) 187 | X_test_processed = np.hstack( 188 | (X_test_categorical_one_hot, X_test_numerical_imputed_scaled) 189 | ) 190 | 191 | model = LogisticRegression(random_state=0).fit(X_train_processed, y_train) 192 | y_train_estimation = model.predict(X_train_processed) 193 | y_test_estimation = model.predict(X_test_processed) 194 | 195 | cm_train = confusion_matrix(y_train, y_train_estimation) 196 | 197 | cm_test = confusion_matrix(y_test, y_test_estimation) 198 | 199 | print('cm_train', cm_train) 200 | print('cm_test', cm_test) 201 | 202 | do_test('../data/cm_test.pkl', cm_test) 203 | do_test('../data/cm_train.pkl', cm_train) 204 | do_test('../data/X_train_processed.pkl', X_train_processed) 205 | do_test('../data/X_test_processed.pkl', X_test_processed) 206 | 207 | do_pandas_test('../data/df.pkl', df) 208 | 209 | 210 | def main(param: str = 'pass'): 211 | titanic_model_creator = TitanicModelCreator( 212 | loader=SqlLoader(connection_string='sqlite:///../data/titanic.db') 213 | ) 214 | titanic_model_creator.run() 215 | 216 | 217 | def test_main(param: str = 'pass'): 218 | titanic_model_creator = TitanicModelCreator( 219 | loader=TestLoader( 220 | passengers_filename='../data/passengers_with_is_survived.pkl', 221 | real_loader=SqlLoader(connection_string='sqlite:///../data/titanic.db'), 222 | ) 223 | ) 224 | titanic_model_creator.run() 225 | 226 | 227 | if __name__ == "__main__": 228 | typer.run(test_main) 229 | -------------------------------------------------------------------------------- /Step13/titanic_model.sh: -------------------------------------------------------------------------------- 1 | python3.8 ./titanic_model.py $1 2 | -------------------------------------------------------------------------------- /Step14/README13.md: -------------------------------------------------------------------------------- 1 | ### Step 13: Use Passenger objects in the program 2 | 3 | - Add `PassengerLoader` to `main` and `test_main` 4 | - Add the `RARE_TITLES` constant 5 | - Convert the classes back into the `df` dataframe with `passenger.dict()` 6 | 7 | It is very important to do refactoring incrementally. Any change should be small enough that if the tests fail the source can be found quickly. So for now we stop at using the new loader but do not change anything else. 8 | -------------------------------------------------------------------------------- /Step14/make_venv.sh: -------------------------------------------------------------------------------- 1 | python3.8 -m venv .venv 2 | source .venv/bin/activate 3 | pip3 install --upgrade pip 4 | pip3 install setuptools==57.1.0 5 | pip3 install wheel 6 | pip3 install -r requirements.txt 7 | -------------------------------------------------------------------------------- /Step14/requirements.txt: -------------------------------------------------------------------------------- 1 | typer 2 | scikit-learn 3 | jupyter 4 | pandas 5 | numpy 6 | sqlalchemy 7 | pydantic 8 | -------------------------------------------------------------------------------- /Step14/titanic_model.py: -------------------------------------------------------------------------------- 1 | import os 2 | import pickle 3 | import typer 4 | import numpy as np 5 | import pandas as pd 6 | from pydantic import BaseModel 7 | from sqlalchemy import create_engine 8 | 9 | from sklearn.model_selection import train_test_split 10 | from sklearn.linear_model import LogisticRegression 11 | from sklearn.preprocessing import RobustScaler 12 | from sklearn.preprocessing import OneHotEncoder 13 | from sklearn.impute import KNNImputer 14 | from sklearn.metrics import confusion_matrix 15 | 16 | 17 | RARE_TITLES = { 18 | 'Capt', 19 | 'Col', 20 | 'Don', 21 | 'Dona', 22 | 'Dr', 23 | 'Jonkheer', 24 | 'Lady', 25 | 'Major', 26 | 'Mlle', 27 | 'Mme', 28 | 'Ms', 29 | 'Rev', 30 | 'Sir', 31 | 'the Countess', 32 | } 33 | 34 | 35 | class Passenger(BaseModel): 36 | pid: int 37 | pclass: int 38 | sex: str 39 | age: float 40 | family_size: int 41 | fare: float 42 | embarked: str 43 | is_alone: int 44 | title: str 45 | is_survived: int 46 | 47 | 48 | def do_test(filename, data): 49 | if not os.path.isfile(filename): 50 | pickle.dump(data, open(filename, 'wb')) 51 | truth = pickle.load(open(filename, 'rb')) 52 | try: 53 | np.testing.assert_almost_equal(data, truth) 54 | print(f'{filename} test passed') 55 | except AssertionError as ex: 56 | print(f'{filename} test failed {ex}') 57 | 58 | 59 | def do_pandas_test(filename, data): 60 | if not os.path.isfile(filename): 61 | data.to_pickle(filename) 62 | truth = pd.read_pickle(filename) 63 | try: 64 | pd.testing.assert_frame_equal(data, truth) 65 | print(f'{filename} pandas test passed') 66 | except AssertionError as ex: 67 | print(f'{filename} pandas test failed {ex}') 68 | 69 | 70 | class SqlLoader: 71 | def __init__(self, connection_string): 72 | engine = create_engine(connection_string) 73 | self.connection = engine.connect() 74 | 75 | def get_passengers(self): 76 | query = """ 77 | SELECT 78 | tbl_passengers.pid, 79 | tbl_passengers.pclass, 80 | tbl_passengers.sex, 81 | tbl_passengers.age, 82 | tbl_passengers.parch, 83 | tbl_passengers.sibsp, 84 | tbl_passengers.fare, 85 | tbl_passengers.embarked, 86 | tbl_passengers.name, 87 | tbl_targets.is_survived 88 | FROM 89 | tbl_passengers 90 | JOIN 91 | tbl_targets 92 | ON 93 | tbl_passengers.pid=tbl_targets.pid 94 | """ 95 | return pd.read_sql(query, con=self.connection) 96 | 97 | 98 | class TestLoader: 99 | def __init__(self, passengers_filename, real_loader): 100 | self.passengers_filename = passengers_filename 101 | self.real_loader = real_loader 102 | if not os.path.isfile(self.passengers_filename): 103 | df = self.real_loader.get_passengers() 104 | df.to_pickle(self.passengers_filename) 105 | 106 | def get_passengers(self): 107 | return pd.read_pickle(self.passengers_filename) 108 | 109 | 110 | class PassengerLoader: 111 | def __init__(self, loader, rare_titles=None): 112 | self.loader = loader 113 | self.rare_titles = rare_titles 114 | 115 | def get_passengers(self): 116 | passengers = [] 117 | for data in self.loader.get_passengers().itertuples(): 118 | # parch = Parents/Children, sibsp = Siblings/Spouses 119 | family_size = int(data.parch + data.sibsp) 120 | # Allen, Miss. Elisabeth Walton 121 | title = data.name.split(',')[1].split('.')[0].strip() 122 | passenger = Passenger( 123 | pid=int(data.pid), 124 | pclass=int(data.pclass), 125 | sex=str(data.sex), 126 | age=float(data.age), 127 | family_size=family_size, 128 | fare=float(data.fare), 129 | embarked=str(data.embarked), 130 | is_alone=1 if family_size == 1 else 0, 131 | title='rare' if title in self.rare_titles else title, 132 | is_survived=int(data.is_survived), 133 | ) 134 | passengers.append(passenger) 135 | return passengers 136 | 137 | 138 | class TitanicModelCreator: 139 | def __init__(self, loader): 140 | self.loader = loader 141 | np.random.seed(42) 142 | 143 | def run(self): 144 | df = pd.DataFrame([v.dict() for v in self.loader.get_passengers()]) 145 | targets = [int(v) for v in df['is_survived']] 146 | 147 | X_train, X_test, y_train, y_test = train_test_split( 148 | df, targets, stratify=targets, test_size=0.2 149 | ) 150 | 151 | X_train_categorical = X_train[ 152 | ['embarked', 'sex', 'pclass', 'title', 'is_alone'] 153 | ] 154 | X_test_categorical = X_test[['embarked', 'sex', 'pclass', 'title', 'is_alone']] 155 | 156 | one_hot_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False).fit( 157 | X_train_categorical 158 | ) 159 | X_train_categorical_one_hot = one_hot_encoder.transform(X_train_categorical) 160 | X_test_categorical_one_hot = one_hot_encoder.transform(X_test_categorical) 161 | 162 | X_train_numerical = X_train[['age', 'fare', 'family_size']] 163 | X_test_numerical = X_test[['age', 'fare', 'family_size']] 164 | knn_imputer = KNNImputer(n_neighbors=5).fit(X_train_numerical) 165 | X_train_numerical_imputed = knn_imputer.transform(X_train_numerical) 166 | X_test_numerical_imputed = knn_imputer.transform(X_test_numerical) 167 | 168 | robust_scaler = RobustScaler().fit(X_train_numerical_imputed) 169 | X_train_numerical_imputed_scaled = robust_scaler.transform( 170 | X_train_numerical_imputed 171 | ) 172 | X_test_numerical_imputed_scaled = robust_scaler.transform( 173 | X_test_numerical_imputed 174 | ) 175 | 176 | X_train_processed = np.hstack( 177 | (X_train_categorical_one_hot, X_train_numerical_imputed_scaled) 178 | ) 179 | X_test_processed = np.hstack( 180 | (X_test_categorical_one_hot, X_test_numerical_imputed_scaled) 181 | ) 182 | 183 | model = LogisticRegression(random_state=0).fit(X_train_processed, y_train) 184 | y_train_estimation = model.predict(X_train_processed) 185 | y_test_estimation = model.predict(X_test_processed) 186 | 187 | cm_train = confusion_matrix(y_train, y_train_estimation) 188 | 189 | cm_test = confusion_matrix(y_test, y_test_estimation) 190 | 191 | print('cm_train', cm_train) 192 | print('cm_test', cm_test) 193 | 194 | do_test('../data/cm_test.pkl', cm_test) 195 | do_test('../data/cm_train.pkl', cm_train) 196 | do_test('../data/X_train_processed.pkl', X_train_processed) 197 | do_test('../data/X_test_processed.pkl', X_test_processed) 198 | 199 | do_pandas_test('../data/df_no_tickets.pkl', df) 200 | 201 | 202 | def main(param: str = 'pass'): 203 | titanic_model_creator = TitanicModelCreator( 204 | loader=PassengerLoader( 205 | loader=SqlLoader(connection_string='sqlite:///../data/titanic.db'), 206 | rare_titles=RARE_TITLES, 207 | ) 208 | ) 209 | titanic_model_creator.run() 210 | 211 | 212 | def test_main(param: str = 'pass'): 213 | titanic_model_creator = TitanicModelCreator( 214 | loader=PassengerLoader( 215 | loader=TestLoader( 216 | passengers_filename='../data/passengers_with_is_survived.pkl', 217 | real_loader=SqlLoader(connection_string='sqlite:///../data/titanic.db'), 218 | ), 219 | rare_titles=RARE_TITLES, 220 | ) 221 | ) 222 | titanic_model_creator.run() 223 | 224 | 225 | if __name__ == "__main__": 226 | typer.run(test_main) 227 | -------------------------------------------------------------------------------- /Step14/titanic_model.sh: -------------------------------------------------------------------------------- 1 | python3.8 ./titanic_model.py $1 2 | -------------------------------------------------------------------------------- /Step15/README14.md: -------------------------------------------------------------------------------- 1 | ### Step 14: Separate training and evaluation functions 2 | 3 | - Move all code related to evaluation (variables that has `_test_` in their name) into one group 4 | 5 | After creating the model first it is trained, then it is evaluated on the training data, then it is evaluated on the testing data. These should be separated from each other into their own logical place. This will prepare to move them into an actually separated place. 6 | -------------------------------------------------------------------------------- /Step15/make_venv.sh: -------------------------------------------------------------------------------- 1 | python3.8 -m venv .venv 2 | source .venv/bin/activate 3 | pip3 install --upgrade pip 4 | pip3 install setuptools==57.1.0 5 | pip3 install wheel 6 | pip3 install -r requirements.txt 7 | -------------------------------------------------------------------------------- /Step15/requirements.txt: -------------------------------------------------------------------------------- 1 | typer 2 | scikit-learn 3 | jupyter 4 | pandas 5 | numpy 6 | sqlalchemy 7 | pydantic 8 | -------------------------------------------------------------------------------- /Step15/titanic_model.py: -------------------------------------------------------------------------------- 1 | import os 2 | import pickle 3 | import typer 4 | import numpy as np 5 | import pandas as pd 6 | from pydantic import BaseModel 7 | from sqlalchemy import create_engine 8 | 9 | from sklearn.model_selection import train_test_split 10 | from sklearn.linear_model import LogisticRegression 11 | from sklearn.preprocessing import RobustScaler 12 | from sklearn.preprocessing import OneHotEncoder 13 | from sklearn.impute import KNNImputer 14 | from sklearn.metrics import confusion_matrix 15 | 16 | 17 | RARE_TITLES = { 18 | 'Capt', 19 | 'Col', 20 | 'Don', 21 | 'Dona', 22 | 'Dr', 23 | 'Jonkheer', 24 | 'Lady', 25 | 'Major', 26 | 'Mlle', 27 | 'Mme', 28 | 'Ms', 29 | 'Rev', 30 | 'Sir', 31 | 'the Countess', 32 | } 33 | 34 | 35 | class Passenger(BaseModel): 36 | pid: int 37 | pclass: int 38 | sex: str 39 | age: float 40 | family_size: int 41 | fare: float 42 | embarked: str 43 | is_alone: int 44 | title: str 45 | is_survived: int 46 | 47 | 48 | def do_test(filename, data): 49 | if not os.path.isfile(filename): 50 | pickle.dump(data, open(filename, 'wb')) 51 | truth = pickle.load(open(filename, 'rb')) 52 | try: 53 | np.testing.assert_almost_equal(data, truth) 54 | print(f'{filename} test passed') 55 | except AssertionError as ex: 56 | print(f'{filename} test failed {ex}') 57 | 58 | 59 | def do_pandas_test(filename, data): 60 | if not os.path.isfile(filename): 61 | data.to_pickle(filename) 62 | truth = pd.read_pickle(filename) 63 | try: 64 | pd.testing.assert_frame_equal(data, truth) 65 | print(f'{filename} pandas test passed') 66 | except AssertionError as ex: 67 | print(f'{filename} pandas test failed {ex}') 68 | 69 | 70 | class SqlLoader: 71 | def __init__(self, connection_string): 72 | engine = create_engine(connection_string) 73 | self.connection = engine.connect() 74 | 75 | def get_passengers(self): 76 | query = """ 77 | SELECT 78 | tbl_passengers.pid, 79 | tbl_passengers.pclass, 80 | tbl_passengers.sex, 81 | tbl_passengers.age, 82 | tbl_passengers.parch, 83 | tbl_passengers.sibsp, 84 | tbl_passengers.fare, 85 | tbl_passengers.embarked, 86 | tbl_passengers.name, 87 | tbl_targets.is_survived 88 | FROM 89 | tbl_passengers 90 | JOIN 91 | tbl_targets 92 | ON 93 | tbl_passengers.pid=tbl_targets.pid 94 | """ 95 | return pd.read_sql(query, con=self.connection) 96 | 97 | 98 | class TestLoader: 99 | def __init__(self, passengers_filename, real_loader): 100 | self.passengers_filename = passengers_filename 101 | self.real_loader = real_loader 102 | if not os.path.isfile(self.passengers_filename): 103 | df = self.real_loader.get_passengers() 104 | df.to_pickle(self.passengers_filename) 105 | 106 | def get_passengers(self): 107 | return pd.read_pickle(self.passengers_filename) 108 | 109 | 110 | class PassengerLoader: 111 | def __init__(self, loader, rare_titles=None): 112 | self.loader = loader 113 | self.rare_titles = rare_titles 114 | 115 | def get_passengers(self): 116 | passengers = [] 117 | for data in self.loader.get_passengers().itertuples(): 118 | # parch = Parents/Children, sibsp = Siblings/Spouses 119 | family_size = int(data.parch + data.sibsp) 120 | # Allen, Miss. Elisabeth Walton 121 | title = data.name.split(',')[1].split('.')[0].strip() 122 | passenger = Passenger( 123 | pid=int(data.pid), 124 | pclass=int(data.pclass), 125 | sex=str(data.sex), 126 | age=float(data.age), 127 | family_size=family_size, 128 | fare=float(data.fare), 129 | embarked=str(data.embarked), 130 | is_alone=1 if family_size == 1 else 0, 131 | title='rare' if title in self.rare_titles else title, 132 | is_survived=int(data.is_survived), 133 | ) 134 | passengers.append(passenger) 135 | return passengers 136 | 137 | 138 | class TitanicModelCreator: 139 | def __init__(self, loader): 140 | self.loader = loader 141 | np.random.seed(42) 142 | 143 | def run(self): 144 | df = pd.DataFrame([v.dict() for v in self.loader.get_passengers()]) 145 | targets = [int(v) for v in df['is_survived']] 146 | 147 | X_train, X_test, y_train, y_test = train_test_split( 148 | df, targets, stratify=targets, test_size=0.2 149 | ) 150 | 151 | # --- TRAINING --- 152 | X_train_categorical = X_train[ 153 | ['embarked', 'sex', 'pclass', 'title', 'is_alone'] 154 | ] 155 | 156 | one_hot_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False).fit( 157 | X_train_categorical 158 | ) 159 | X_train_categorical_one_hot = one_hot_encoder.transform(X_train_categorical) 160 | 161 | X_train_numerical = X_train[['age', 'fare', 'family_size']] 162 | knn_imputer = KNNImputer(n_neighbors=5).fit(X_train_numerical) 163 | X_train_numerical_imputed = knn_imputer.transform(X_train_numerical) 164 | 165 | robust_scaler = RobustScaler().fit(X_train_numerical_imputed) 166 | X_train_numerical_imputed_scaled = robust_scaler.transform( 167 | X_train_numerical_imputed 168 | ) 169 | 170 | X_train_processed = np.hstack( 171 | (X_train_categorical_one_hot, X_train_numerical_imputed_scaled) 172 | ) 173 | 174 | model = LogisticRegression(random_state=0).fit(X_train_processed, y_train) 175 | y_train_estimation = model.predict(X_train_processed) 176 | 177 | cm_train = confusion_matrix(y_train, y_train_estimation) 178 | 179 | # --- TESTING --- 180 | X_test_categorical = X_test[['embarked', 'sex', 'pclass', 'title', 'is_alone']] 181 | X_test_categorical_one_hot = one_hot_encoder.transform(X_test_categorical) 182 | 183 | X_test_numerical = X_test[['age', 'fare', 'family_size']] 184 | X_test_numerical_imputed = knn_imputer.transform(X_test_numerical) 185 | X_test_numerical_imputed_scaled = robust_scaler.transform( 186 | X_test_numerical_imputed 187 | ) 188 | 189 | X_test_processed = np.hstack( 190 | (X_test_categorical_one_hot, X_test_numerical_imputed_scaled) 191 | ) 192 | 193 | y_test_estimation = model.predict(X_test_processed) 194 | cm_test = confusion_matrix(y_test, y_test_estimation) 195 | 196 | print('cm_train', cm_train) 197 | print('cm_test', cm_test) 198 | 199 | do_test('../data/cm_test.pkl', cm_test) 200 | do_test('../data/cm_train.pkl', cm_train) 201 | do_test('../data/X_train_processed.pkl', X_train_processed) 202 | do_test('../data/X_test_processed.pkl', X_test_processed) 203 | 204 | do_pandas_test('../data/df_no_tickets.pkl', df) 205 | 206 | 207 | def main(param: str = 'pass'): 208 | titanic_model_creator = TitanicModelCreator( 209 | loader=PassengerLoader( 210 | loader=SqlLoader(connection_string='sqlite:///../data/titanic.db'), 211 | rare_titles=RARE_TITLES, 212 | ) 213 | ) 214 | titanic_model_creator.run() 215 | 216 | 217 | def test_main(param: str = 'pass'): 218 | titanic_model_creator = TitanicModelCreator( 219 | loader=PassengerLoader( 220 | loader=TestLoader( 221 | passengers_filename='../data/passengers_with_is_survived.pkl', 222 | real_loader=SqlLoader(connection_string='sqlite:///../data/titanic.db'), 223 | ), 224 | rare_titles=RARE_TITLES, 225 | ) 226 | ) 227 | titanic_model_creator.run() 228 | 229 | 230 | if __name__ == "__main__": 231 | typer.run(test_main) 232 | -------------------------------------------------------------------------------- /Step15/titanic_model.sh: -------------------------------------------------------------------------------- 1 | python3.8 ./titanic_model.py $1 2 | -------------------------------------------------------------------------------- /Step16/README15.md: -------------------------------------------------------------------------------- 1 | ### Step 15: Create `TitanicModel` class 2 | 3 | - Create a class that has all the `sklearn` components as member variables 4 | - Instantiate these before the "Training" block 5 | - Use these instead of the local ones 6 | 7 | The goal of the whole program is to create a model, despite this until now there was no single object describing this model. The next steps is to establish the concept of this model and what kind of services it is providing for `TitanicModelCreator`. 8 | -------------------------------------------------------------------------------- /Step16/make_venv.sh: -------------------------------------------------------------------------------- 1 | python3.8 -m venv .venv 2 | source .venv/bin/activate 3 | pip3 install --upgrade pip 4 | pip3 install setuptools==57.1.0 5 | pip3 install wheel 6 | pip3 install -r requirements.txt 7 | -------------------------------------------------------------------------------- /Step16/requirements.txt: -------------------------------------------------------------------------------- 1 | typer 2 | scikit-learn 3 | jupyter 4 | pandas 5 | numpy 6 | sqlalchemy 7 | pydantic 8 | -------------------------------------------------------------------------------- /Step16/titanic_model.py: -------------------------------------------------------------------------------- 1 | import os 2 | import pickle 3 | import typer 4 | import numpy as np 5 | import pandas as pd 6 | from pydantic import BaseModel 7 | from sqlalchemy import create_engine 8 | 9 | from sklearn.model_selection import train_test_split 10 | from sklearn.linear_model import LogisticRegression 11 | from sklearn.preprocessing import RobustScaler 12 | from sklearn.preprocessing import OneHotEncoder 13 | from sklearn.impute import KNNImputer 14 | from sklearn.metrics import confusion_matrix 15 | 16 | 17 | RARE_TITLES = { 18 | 'Capt', 19 | 'Col', 20 | 'Don', 21 | 'Dona', 22 | 'Dr', 23 | 'Jonkheer', 24 | 'Lady', 25 | 'Major', 26 | 'Mlle', 27 | 'Mme', 28 | 'Ms', 29 | 'Rev', 30 | 'Sir', 31 | 'the Countess', 32 | } 33 | 34 | 35 | class Passenger(BaseModel): 36 | pid: int 37 | pclass: int 38 | sex: str 39 | age: float 40 | family_size: int 41 | fare: float 42 | embarked: str 43 | is_alone: int 44 | title: str 45 | is_survived: int 46 | 47 | 48 | def do_test(filename, data): 49 | if not os.path.isfile(filename): 50 | pickle.dump(data, open(filename, 'wb')) 51 | truth = pickle.load(open(filename, 'rb')) 52 | try: 53 | np.testing.assert_almost_equal(data, truth) 54 | print(f'{filename} test passed') 55 | except AssertionError as ex: 56 | print(f'{filename} test failed {ex}') 57 | 58 | 59 | def do_pandas_test(filename, data): 60 | if not os.path.isfile(filename): 61 | data.to_pickle(filename) 62 | truth = pd.read_pickle(filename) 63 | try: 64 | pd.testing.assert_frame_equal(data, truth) 65 | print(f'{filename} pandas test passed') 66 | except AssertionError as ex: 67 | print(f'{filename} pandas test failed {ex}') 68 | 69 | 70 | class SqlLoader: 71 | def __init__(self, connection_string): 72 | engine = create_engine(connection_string) 73 | self.connection = engine.connect() 74 | 75 | def get_passengers(self): 76 | query = """ 77 | SELECT 78 | tbl_passengers.pid, 79 | tbl_passengers.pclass, 80 | tbl_passengers.sex, 81 | tbl_passengers.age, 82 | tbl_passengers.parch, 83 | tbl_passengers.sibsp, 84 | tbl_passengers.fare, 85 | tbl_passengers.embarked, 86 | tbl_passengers.name, 87 | tbl_targets.is_survived 88 | FROM 89 | tbl_passengers 90 | JOIN 91 | tbl_targets 92 | ON 93 | tbl_passengers.pid=tbl_targets.pid 94 | """ 95 | return pd.read_sql(query, con=self.connection) 96 | 97 | 98 | class TestLoader: 99 | def __init__(self, passengers_filename, real_loader): 100 | self.passengers_filename = passengers_filename 101 | self.real_loader = real_loader 102 | if not os.path.isfile(self.passengers_filename): 103 | df = self.real_loader.get_passengers() 104 | df.to_pickle(self.passengers_filename) 105 | 106 | def get_passengers(self): 107 | return pd.read_pickle(self.passengers_filename) 108 | 109 | 110 | class PassengerLoader: 111 | def __init__(self, loader, rare_titles=None): 112 | self.loader = loader 113 | self.rare_titles = rare_titles 114 | 115 | def get_passengers(self): 116 | passengers = [] 117 | for data in self.loader.get_passengers().itertuples(): 118 | # parch = Parents/Children, sibsp = Siblings/Spouses 119 | family_size = int(data.parch + data.sibsp) 120 | # Allen, Miss. Elisabeth Walton 121 | title = data.name.split(',')[1].split('.')[0].strip() 122 | passenger = Passenger( 123 | pid=int(data.pid), 124 | pclass=int(data.pclass), 125 | sex=str(data.sex), 126 | age=float(data.age), 127 | family_size=family_size, 128 | fare=float(data.fare), 129 | embarked=str(data.embarked), 130 | is_alone=1 if family_size == 1 else 0, 131 | title='rare' if title in self.rare_titles else title, 132 | is_survived=int(data.is_survived), 133 | ) 134 | passengers.append(passenger) 135 | return passengers 136 | 137 | 138 | class TitanicModel: 139 | def __init__(self): 140 | self.one_hot_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False) 141 | self.knn_imputer = KNNImputer(n_neighbors=5) 142 | self.robust_scaler = RobustScaler() 143 | self.predictor = LogisticRegression(random_state=0) 144 | 145 | def train(self): 146 | pass 147 | 148 | def estimate(self, passengers): 149 | return 1 150 | 151 | 152 | class TitanicModelCreator: 153 | def __init__(self, loader): 154 | self.loader = loader 155 | np.random.seed(42) 156 | 157 | def run(self): 158 | df = pd.DataFrame([v.dict() for v in self.loader.get_passengers()]) 159 | targets = [int(v) for v in df['is_survived']] 160 | 161 | X_train, X_test, y_train, y_test = train_test_split( 162 | df, targets, stratify=targets, test_size=0.2 163 | ) 164 | 165 | # --- TRAINING --- 166 | model = TitanicModel() 167 | 168 | X_train_categorical = X_train[ 169 | ['embarked', 'sex', 'pclass', 'title', 'is_alone'] 170 | ] 171 | 172 | model.one_hot_encoder.fit(X_train_categorical) 173 | X_train_categorical_one_hot = model.one_hot_encoder.transform(X_train_categorical) 174 | 175 | X_train_numerical = X_train[['age', 'fare', 'family_size']] 176 | model.knn_imputer.fit(X_train_numerical) 177 | X_train_numerical_imputed = model.knn_imputer.transform(X_train_numerical) 178 | 179 | model.robust_scaler.fit(X_train_numerical_imputed) 180 | X_train_numerical_imputed_scaled = model.robust_scaler.transform( 181 | X_train_numerical_imputed 182 | ) 183 | 184 | X_train_processed = np.hstack( 185 | (X_train_categorical_one_hot, X_train_numerical_imputed_scaled) 186 | ) 187 | 188 | model.predictor.fit(X_train_processed, y_train) 189 | y_train_estimation = model.predictor.predict(X_train_processed) 190 | 191 | cm_train = confusion_matrix(y_train, y_train_estimation) 192 | 193 | # --- TESTING --- 194 | X_test_categorical = X_test[['embarked', 'sex', 'pclass', 'title', 'is_alone']] 195 | X_test_categorical_one_hot = model.one_hot_encoder.transform(X_test_categorical) 196 | 197 | X_test_numerical = X_test[['age', 'fare', 'family_size']] 198 | X_test_numerical_imputed = model.knn_imputer.transform(X_test_numerical) 199 | X_test_numerical_imputed_scaled = model.robust_scaler.transform( 200 | X_test_numerical_imputed 201 | ) 202 | 203 | X_test_processed = np.hstack( 204 | (X_test_categorical_one_hot, X_test_numerical_imputed_scaled) 205 | ) 206 | 207 | y_test_estimation = model.predictor.predict(X_test_processed) 208 | cm_test = confusion_matrix(y_test, y_test_estimation) 209 | 210 | print('cm_train', cm_train) 211 | print('cm_test', cm_test) 212 | 213 | do_test('../data/cm_test.pkl', cm_test) 214 | do_test('../data/cm_train.pkl', cm_train) 215 | do_test('../data/X_train_processed.pkl', X_train_processed) 216 | do_test('../data/X_test_processed.pkl', X_test_processed) 217 | 218 | ('../data/df_no_tickets.pkl', df) 219 | 220 | 221 | def main(param: str = 'pass'): 222 | titanic_model_creator = TitanicModelCreator( 223 | loader=PassengerLoader( 224 | loader=SqlLoader(connection_string='sqlite:///../data/titanic.db'), 225 | rare_titles=RARE_TITLES, 226 | ) 227 | ) 228 | titanic_model_creator.run() 229 | 230 | 231 | def test_main(param: str = 'pass'): 232 | titanic_model_creator = TitanicModelCreator( 233 | loader=PassengerLoader( 234 | loader=TestLoader( 235 | passengers_filename='../data/passengers_with_is_survived.pkl', 236 | real_loader=SqlLoader(connection_string='sqlite:///../data/titanic.db'), 237 | ), 238 | rare_titles=RARE_TITLES, 239 | ) 240 | ) 241 | titanic_model_creator.run() 242 | 243 | 244 | if __name__ == "__main__": 245 | typer.run(test_main) 246 | -------------------------------------------------------------------------------- /Step16/titanic_model.sh: -------------------------------------------------------------------------------- 1 | python3.8 ./titanic_model.py $1 2 | -------------------------------------------------------------------------------- /Step17/README16.md: -------------------------------------------------------------------------------- 1 | ### Step 16: Passenger class based training and evaluation sets 2 | 3 | - Create a function in `TitanicModelCreator` that splits the `passengers` stratified by the "targets" (namely if the passenger survived or not) 4 | - Refactor `X_train/X_test` to be created from these lists of passengers 5 | 6 | Because `train_test_split` works on lists, we extract the pids and the targets from the classes and create the two sets from a mapping from pids to passenger classes. 7 | -------------------------------------------------------------------------------- /Step17/make_venv.sh: -------------------------------------------------------------------------------- 1 | python3.8 -m venv .venv 2 | source .venv/bin/activate 3 | pip3 install --upgrade pip 4 | pip3 install setuptools==57.1.0 5 | pip3 install wheel 6 | pip3 install -r requirements.txt 7 | -------------------------------------------------------------------------------- /Step17/requirements.txt: -------------------------------------------------------------------------------- 1 | typer 2 | scikit-learn 3 | jupyter 4 | pandas 5 | numpy 6 | sqlalchemy 7 | pydantic 8 | -------------------------------------------------------------------------------- /Step17/titanic_model.py: -------------------------------------------------------------------------------- 1 | import os 2 | import pickle 3 | import typer 4 | import numpy as np 5 | import pandas as pd 6 | from pydantic import BaseModel 7 | from sqlalchemy import create_engine 8 | 9 | from sklearn.model_selection import train_test_split 10 | from sklearn.linear_model import LogisticRegression 11 | from sklearn.preprocessing import RobustScaler 12 | from sklearn.preprocessing import OneHotEncoder 13 | from sklearn.impute import KNNImputer 14 | from sklearn.metrics import confusion_matrix 15 | 16 | 17 | RARE_TITLES = { 18 | 'Capt', 19 | 'Col', 20 | 'Don', 21 | 'Dona', 22 | 'Dr', 23 | 'Jonkheer', 24 | 'Lady', 25 | 'Major', 26 | 'Mlle', 27 | 'Mme', 28 | 'Ms', 29 | 'Rev', 30 | 'Sir', 31 | 'the Countess', 32 | } 33 | 34 | 35 | class Passenger(BaseModel): 36 | pid: int 37 | pclass: int 38 | sex: str 39 | age: float 40 | family_size: int 41 | fare: float 42 | embarked: str 43 | is_alone: int 44 | title: str 45 | is_survived: int 46 | 47 | 48 | def do_test(filename, data): 49 | if not os.path.isfile(filename): 50 | pickle.dump(data, open(filename, 'wb')) 51 | truth = pickle.load(open(filename, 'rb')) 52 | try: 53 | np.testing.assert_almost_equal(data, truth) 54 | print(f'{filename} test passed') 55 | except AssertionError as ex: 56 | print(f'{filename} test failed {ex}') 57 | 58 | 59 | def do_pandas_test(filename, data): 60 | if not os.path.isfile(filename): 61 | data.to_pickle(filename) 62 | truth = pd.read_pickle(filename) 63 | try: 64 | pd.testing.assert_frame_equal(data, truth) 65 | print(f'{filename} pandas test passed') 66 | except AssertionError as ex: 67 | print(f'{filename} pandas test failed {ex}') 68 | 69 | 70 | class SqlLoader: 71 | def __init__(self, connection_string): 72 | engine = create_engine(connection_string) 73 | self.connection = engine.connect() 74 | 75 | def get_passengers(self): 76 | query = """ 77 | SELECT 78 | tbl_passengers.pid, 79 | tbl_passengers.pclass, 80 | tbl_passengers.sex, 81 | tbl_passengers.age, 82 | tbl_passengers.parch, 83 | tbl_passengers.sibsp, 84 | tbl_passengers.fare, 85 | tbl_passengers.embarked, 86 | tbl_passengers.name, 87 | tbl_targets.is_survived 88 | FROM 89 | tbl_passengers 90 | JOIN 91 | tbl_targets 92 | ON 93 | tbl_passengers.pid=tbl_targets.pid 94 | """ 95 | return pd.read_sql(query, con=self.connection) 96 | 97 | 98 | class TestLoader: 99 | def __init__(self, passengers_filename, real_loader): 100 | self.passengers_filename = passengers_filename 101 | self.real_loader = real_loader 102 | if not os.path.isfile(self.passengers_filename): 103 | df = self.real_loader.get_passengers() 104 | df.to_pickle(self.passengers_filename) 105 | 106 | def get_passengers(self): 107 | return pd.read_pickle(self.passengers_filename) 108 | 109 | 110 | class PassengerLoader: 111 | def __init__(self, loader, rare_titles=None): 112 | self.loader = loader 113 | self.rare_titles = rare_titles 114 | 115 | def get_passengers(self): 116 | passengers = [] 117 | for data in self.loader.get_passengers().itertuples(): 118 | # parch = Parents/Children, sibsp = Siblings/Spouses 119 | family_size = int(data.parch + data.sibsp) 120 | # Allen, Miss. Elisabeth Walton 121 | title = data.name.split(',')[1].split('.')[0].strip() 122 | passenger = Passenger( 123 | pid=int(data.pid), 124 | pclass=int(data.pclass), 125 | sex=str(data.sex), 126 | age=float(data.age), 127 | family_size=family_size, 128 | fare=float(data.fare), 129 | embarked=str(data.embarked), 130 | is_alone=1 if family_size == 1 else 0, 131 | title='rare' if title in self.rare_titles else title, 132 | is_survived=int(data.is_survived), 133 | ) 134 | passengers.append(passenger) 135 | return passengers 136 | 137 | 138 | class TitanicModel: 139 | def __init__(self): 140 | self.one_hot_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False) 141 | self.knn_imputer = KNNImputer(n_neighbors=5) 142 | self.robust_scaler = RobustScaler() 143 | self.predictor = LogisticRegression(random_state=0) 144 | 145 | def train(self): 146 | pass 147 | 148 | def estimate(self, passengers): 149 | return 1 150 | 151 | 152 | class TitanicModelCreator: 153 | def __init__(self, loader): 154 | self.loader = loader 155 | np.random.seed(42) 156 | 157 | def split_passengers(self, passengers): 158 | passengers_map = {p.pid: p for p in passengers} 159 | pids = [passenger.pid for passenger in passengers] 160 | targets = [passenger.is_survived for passenger in passengers] 161 | train_pids, test_pids = train_test_split(pids, stratify=targets, test_size=0.2) 162 | train_passengers = [passengers_map[pid] for pid in train_pids] 163 | test_passengers = [passengers_map[pid] for pid in test_pids] 164 | return train_passengers, test_passengers 165 | 166 | def run(self): 167 | passengers = self.loader.get_passengers() 168 | train_passengers, test_passengers = self.split_passengers(passengers) 169 | 170 | X_train = pd.DataFrame([v.dict() for v in train_passengers]) 171 | y_train = [v.is_survived for v in train_passengers] 172 | X_test = pd.DataFrame([v.dict() for v in test_passengers]) 173 | y_test = [v.is_survived for v in test_passengers] 174 | 175 | # --- TRAINING --- 176 | model = TitanicModel() 177 | 178 | X_train_categorical = X_train[ 179 | ['embarked', 'sex', 'pclass', 'title', 'is_alone'] 180 | ] 181 | 182 | model.one_hot_encoder.fit(X_train_categorical) 183 | X_train_categorical_one_hot = model.one_hot_encoder.transform(X_train_categorical) 184 | 185 | X_train_numerical = X_train[['age', 'fare', 'family_size']] 186 | model.knn_imputer.fit(X_train_numerical) 187 | X_train_numerical_imputed = model.knn_imputer.transform(X_train_numerical) 188 | 189 | model.robust_scaler.fit(X_train_numerical_imputed) 190 | X_train_numerical_imputed_scaled = model.robust_scaler.transform( 191 | X_train_numerical_imputed 192 | ) 193 | 194 | X_train_processed = np.hstack( 195 | (X_train_categorical_one_hot, X_train_numerical_imputed_scaled) 196 | ) 197 | 198 | model.predictor.fit(X_train_processed, y_train) 199 | y_train_estimation = model.predictor.predict(X_train_processed) 200 | 201 | cm_train = confusion_matrix(y_train, y_train_estimation) 202 | 203 | # --- TESTING --- 204 | X_test_categorical = X_test[['embarked', 'sex', 'pclass', 'title', 'is_alone']] 205 | X_test_categorical_one_hot = model.one_hot_encoder.transform(X_test_categorical) 206 | 207 | X_test_numerical = X_test[['age', 'fare', 'family_size']] 208 | X_test_numerical_imputed = model.knn_imputer.transform(X_test_numerical) 209 | X_test_numerical_imputed_scaled = model.robust_scaler.transform( 210 | X_test_numerical_imputed 211 | ) 212 | 213 | X_test_processed = np.hstack( 214 | (X_test_categorical_one_hot, X_test_numerical_imputed_scaled) 215 | ) 216 | 217 | y_test_estimation = model.predictor.predict(X_test_processed) 218 | cm_test = confusion_matrix(y_test, y_test_estimation) 219 | 220 | print('cm_train', cm_train) 221 | print('cm_test', cm_test) 222 | 223 | do_test('../data/cm_test.pkl', cm_test) 224 | do_test('../data/cm_train.pkl', cm_train) 225 | do_test('../data/X_train_processed.pkl', X_train_processed) 226 | do_test('../data/X_test_processed.pkl', X_test_processed) 227 | 228 | do_pandas_test( 229 | '../data/df_no_tickets.pkl', pd.DataFrame([v.dict() for v in passengers]) 230 | ) 231 | 232 | 233 | def main(param: str = 'pass'): 234 | titanic_model_creator = TitanicModelCreator( 235 | loader=PassengerLoader( 236 | loader=SqlLoader(connection_string='sqlite:///../data/titanic.db'), 237 | rare_titles=RARE_TITLES, 238 | ) 239 | ) 240 | titanic_model_creator.run() 241 | 242 | 243 | def test_main(param: str = 'pass'): 244 | titanic_model_creator = TitanicModelCreator( 245 | loader=PassengerLoader( 246 | loader=TestLoader( 247 | passengers_filename='../data/passengers_with_is_survived.pkl', 248 | real_loader=SqlLoader(connection_string='sqlite:///../data/titanic.db'), 249 | ), 250 | rare_titles=RARE_TITLES, 251 | ) 252 | ) 253 | titanic_model_creator.run() 254 | 255 | 256 | if __name__ == "__main__": 257 | typer.run(test_main) 258 | -------------------------------------------------------------------------------- /Step17/titanic_model.sh: -------------------------------------------------------------------------------- 1 | python3.8 ./titanic_model.py $1 2 | -------------------------------------------------------------------------------- /Step18/README17.md: -------------------------------------------------------------------------------- 1 | ### Step 17: Create input processing for `TitanicModel` 2 | 3 | - Move code in `run()` from between instantiating `TitanicModel` and training (`model.predictor.fit`) to the `process_inputs` function of `TitanicModel`. 4 | - Introduce `self.trained` boolean 5 | - Based on `self.trained` either call the `transform` or `fit_transform` of the `sklearn` input processor functions 6 | 7 | All the input transformation code happen twice. Once for training data once for evaluation data. While transforming the data is a responsibility of the model. This is a codesmell called "feature envy". `TitanicModelCreator` envies the functionality from `TitanicModel`. There will be several steps to resolve this. The resulting code will create a self contained model that can be shipped independetly from its creator. 8 | 9 | -------------------------------------------------------------------------------- /Step18/make_venv.sh: -------------------------------------------------------------------------------- 1 | python3.8 -m venv .venv 2 | source .venv/bin/activate 3 | pip3 install --upgrade pip 4 | pip3 install setuptools==57.1.0 5 | pip3 install wheel 6 | pip3 install -r requirements.txt 7 | -------------------------------------------------------------------------------- /Step18/requirements.txt: -------------------------------------------------------------------------------- 1 | typer 2 | scikit-learn 3 | jupyter 4 | pandas 5 | numpy 6 | sqlalchemy 7 | pydantic 8 | -------------------------------------------------------------------------------- /Step18/titanic_model.py: -------------------------------------------------------------------------------- 1 | import os 2 | import pickle 3 | import typer 4 | import numpy as np 5 | import pandas as pd 6 | from pydantic import BaseModel 7 | from sqlalchemy import create_engine 8 | 9 | from sklearn.model_selection import train_test_split 10 | from sklearn.linear_model import LogisticRegression 11 | from sklearn.preprocessing import RobustScaler 12 | from sklearn.preprocessing import OneHotEncoder 13 | from sklearn.impute import KNNImputer 14 | from sklearn.metrics import confusion_matrix 15 | 16 | 17 | RARE_TITLES = { 18 | 'Capt', 19 | 'Col', 20 | 'Don', 21 | 'Dona', 22 | 'Dr', 23 | 'Jonkheer', 24 | 'Lady', 25 | 'Major', 26 | 'Mlle', 27 | 'Mme', 28 | 'Ms', 29 | 'Rev', 30 | 'Sir', 31 | 'the Countess', 32 | } 33 | 34 | 35 | class Passenger(BaseModel): 36 | pid: int 37 | pclass: int 38 | sex: str 39 | age: float 40 | family_size: int 41 | fare: float 42 | embarked: str 43 | is_alone: int 44 | title: str 45 | is_survived: int 46 | 47 | 48 | def do_test(filename, data): 49 | if not os.path.isfile(filename): 50 | pickle.dump(data, open(filename, 'wb')) 51 | truth = pickle.load(open(filename, 'rb')) 52 | try: 53 | np.testing.assert_almost_equal(data, truth) 54 | print(f'{filename} test passed') 55 | except AssertionError as ex: 56 | print(f'{filename} test failed {ex}') 57 | 58 | 59 | def do_pandas_test(filename, data): 60 | if not os.path.isfile(filename): 61 | data.to_pickle(filename) 62 | truth = pd.read_pickle(filename) 63 | try: 64 | pd.testing.assert_frame_equal(data, truth) 65 | print(f'{filename} pandas test passed') 66 | except AssertionError as ex: 67 | print(f'{filename} pandas test failed {ex}') 68 | 69 | 70 | class SqlLoader: 71 | def __init__(self, connection_string): 72 | engine = create_engine(connection_string) 73 | self.connection = engine.connect() 74 | 75 | def get_passengers(self): 76 | query = """ 77 | SELECT 78 | tbl_passengers.pid, 79 | tbl_passengers.pclass, 80 | tbl_passengers.sex, 81 | tbl_passengers.age, 82 | tbl_passengers.parch, 83 | tbl_passengers.sibsp, 84 | tbl_passengers.fare, 85 | tbl_passengers.embarked, 86 | tbl_passengers.name, 87 | tbl_targets.is_survived 88 | FROM 89 | tbl_passengers 90 | JOIN 91 | tbl_targets 92 | ON 93 | tbl_passengers.pid=tbl_targets.pid 94 | """ 95 | return pd.read_sql(query, con=self.connection) 96 | 97 | 98 | class TestLoader: 99 | def __init__(self, passengers_filename, real_loader): 100 | self.passengers_filename = passengers_filename 101 | self.real_loader = real_loader 102 | if not os.path.isfile(self.passengers_filename): 103 | df = self.real_loader.get_passengers() 104 | df.to_pickle(self.passengers_filename) 105 | 106 | def get_passengers(self): 107 | return pd.read_pickle(self.passengers_filename) 108 | 109 | 110 | class PassengerLoader: 111 | def __init__(self, loader, rare_titles=None): 112 | self.loader = loader 113 | self.rare_titles = rare_titles 114 | 115 | def get_passengers(self): 116 | passengers = [] 117 | for data in self.loader.get_passengers().itertuples(): 118 | # parch = Parents/Children, sibsp = Siblings/Spouses 119 | family_size = int(data.parch + data.sibsp) 120 | # Allen, Miss. Elisabeth Walton 121 | title = data.name.split(',')[1].split('.')[0].strip() 122 | passenger = Passenger( 123 | pid=int(data.pid), 124 | pclass=int(data.pclass), 125 | sex=str(data.sex), 126 | age=float(data.age), 127 | family_size=family_size, 128 | fare=float(data.fare), 129 | embarked=str(data.embarked), 130 | is_alone=1 if family_size == 1 else 0, 131 | title='rare' if title in self.rare_titles else title, 132 | is_survived=int(data.is_survived), 133 | ) 134 | passengers.append(passenger) 135 | return passengers 136 | 137 | 138 | class TitanicModel: 139 | def __init__(self): 140 | self.trained = False 141 | self.one_hot_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False) 142 | self.knn_imputer = KNNImputer(n_neighbors=5) 143 | self.robust_scaler = RobustScaler() 144 | self.predictor = LogisticRegression(random_state=0) 145 | 146 | def process_inputs(self, passengers): 147 | data = pd.DataFrame([v.dict() for v in passengers]) 148 | categorical_data = data[['embarked', 'sex', 'pclass', 'title', 'is_alone']] 149 | numerical_data = data[['age', 'fare', 'family_size']] 150 | if self.trained: 151 | categorical_data = self.one_hot_encoder.transform(categorical_data) 152 | numerical_data = self.robust_scaler.transform( 153 | self.knn_imputer.transform(numerical_data) 154 | ) 155 | else: 156 | categorical_data = self.one_hot_encoder.fit_transform(categorical_data) 157 | numerical_data = self.robust_scaler.fit_transform( 158 | self.knn_imputer.fit_transform(numerical_data) 159 | ) 160 | return np.hstack((categorical_data, numerical_data)) 161 | 162 | def train(self): 163 | pass 164 | 165 | def estimate(self, passengers): 166 | return 1 167 | 168 | 169 | class TitanicModelCreator: 170 | def __init__(self, loader): 171 | self.loader = loader 172 | np.random.seed(42) 173 | 174 | def split_passengers(self, passengers): 175 | passengers_map = {p.pid: p for p in passengers} 176 | pids = [passenger.pid for passenger in passengers] 177 | targets = [passenger.is_survived for passenger in passengers] 178 | train_pids, test_pids = train_test_split(pids, stratify=targets, test_size=0.2) 179 | train_passengers = [passengers_map[pid] for pid in train_pids] 180 | test_passengers = [passengers_map[pid] for pid in test_pids] 181 | return train_passengers, test_passengers 182 | 183 | def run(self): 184 | passengers = self.loader.get_passengers() 185 | train_passengers, test_passengers = self.split_passengers(passengers) 186 | 187 | y_train = [v.is_survived for v in train_passengers] 188 | X_test = pd.DataFrame([v.dict() for v in test_passengers]) 189 | y_test = [v.is_survived for v in test_passengers] 190 | 191 | # --- TRAINING --- 192 | model = TitanicModel() 193 | 194 | X_train_processed = model.process_inputs(train_passengers) 195 | model.predictor.fit(X_train_processed, y_train) 196 | y_train_estimation = model.predictor.predict(X_train_processed) 197 | 198 | cm_train = confusion_matrix(y_train, y_train_estimation) 199 | 200 | # --- TESTING --- 201 | X_test_categorical = X_test[['embarked', 'sex', 'pclass', 'title', 'is_alone']] 202 | X_test_categorical_one_hot = model.one_hot_encoder.transform(X_test_categorical) 203 | 204 | X_test_numerical = X_test[['age', 'fare', 'family_size']] 205 | X_test_numerical_imputed = model.knn_imputer.transform(X_test_numerical) 206 | X_test_numerical_imputed_scaled = model.robust_scaler.transform( 207 | X_test_numerical_imputed 208 | ) 209 | 210 | X_test_processed = np.hstack( 211 | (X_test_categorical_one_hot, X_test_numerical_imputed_scaled) 212 | ) 213 | 214 | y_test_estimation = model.predictor.predict(X_test_processed) 215 | cm_test = confusion_matrix(y_test, y_test_estimation) 216 | 217 | print('cm_train', cm_train) 218 | print('cm_test', cm_test) 219 | 220 | do_test('../data/cm_test.pkl', cm_test) 221 | do_test('../data/cm_train.pkl', cm_train) 222 | do_test('../data/X_train_processed.pkl', X_train_processed) 223 | do_test('../data/X_test_processed.pkl', X_test_processed) 224 | 225 | do_pandas_test( 226 | '../data/df_no_tickets.pkl', pd.DataFrame([v.dict() for v in passengers]) 227 | ) 228 | 229 | 230 | def main(param: str = 'pass'): 231 | titanic_model_creator = TitanicModelCreator( 232 | loader=PassengerLoader( 233 | loader=SqlLoader(connection_string='sqlite:///../data/titanic.db'), 234 | rare_titles=RARE_TITLES, 235 | ) 236 | ) 237 | titanic_model_creator.run() 238 | 239 | 240 | def test_main(param: str = 'pass'): 241 | titanic_model_creator = TitanicModelCreator( 242 | loader=PassengerLoader( 243 | loader=TestLoader( 244 | passengers_filename='../data/passengers_with_is_survived.pkl', 245 | real_loader=SqlLoader(connection_string='sqlite:///../data/titanic.db'), 246 | ), 247 | rare_titles=RARE_TITLES, 248 | ) 249 | ) 250 | titanic_model_creator.run() 251 | 252 | 253 | if __name__ == "__main__": 254 | typer.run(test_main) 255 | -------------------------------------------------------------------------------- /Step18/titanic_model.sh: -------------------------------------------------------------------------------- 1 | python3.8 ./titanic_model.py $1 2 | -------------------------------------------------------------------------------- /Step19/README18.md: -------------------------------------------------------------------------------- 1 | ### Step 18: Move training into `TitanicModel` 2 | 3 | - Use the same interface as `process_inputs` with `train()` 4 | - Process the data with `process_inputs` (just pass through the arguments) 5 | - Recreate the required targets with the mapping 6 | - Train the model and set the `trained` boolean to `True` 7 | -------------------------------------------------------------------------------- /Step19/make_venv.sh: -------------------------------------------------------------------------------- 1 | python3.8 -m venv .venv 2 | source .venv/bin/activate 3 | pip3 install --upgrade pip 4 | pip3 install setuptools==57.1.0 5 | pip3 install wheel 6 | pip3 install -r requirements.txt 7 | -------------------------------------------------------------------------------- /Step19/requirements.txt: -------------------------------------------------------------------------------- 1 | typer 2 | scikit-learn 3 | jupyter 4 | pandas 5 | numpy 6 | sqlalchemy 7 | pydantic 8 | -------------------------------------------------------------------------------- /Step19/titanic_model.py: -------------------------------------------------------------------------------- 1 | import os 2 | import pickle 3 | import typer 4 | import numpy as np 5 | import pandas as pd 6 | from pydantic import BaseModel 7 | from sqlalchemy import create_engine 8 | 9 | from sklearn.model_selection import train_test_split 10 | from sklearn.linear_model import LogisticRegression 11 | from sklearn.preprocessing import RobustScaler 12 | from sklearn.preprocessing import OneHotEncoder 13 | from sklearn.impute import KNNImputer 14 | from sklearn.metrics import confusion_matrix 15 | 16 | 17 | RARE_TITLES = { 18 | 'Capt', 19 | 'Col', 20 | 'Don', 21 | 'Dona', 22 | 'Dr', 23 | 'Jonkheer', 24 | 'Lady', 25 | 'Major', 26 | 'Mlle', 27 | 'Mme', 28 | 'Ms', 29 | 'Rev', 30 | 'Sir', 31 | 'the Countess', 32 | } 33 | 34 | 35 | class Passenger(BaseModel): 36 | pid: int 37 | pclass: int 38 | sex: str 39 | age: float 40 | family_size: int 41 | fare: float 42 | embarked: str 43 | is_alone: int 44 | title: str 45 | is_survived: int 46 | 47 | 48 | def do_test(filename, data): 49 | if not os.path.isfile(filename): 50 | pickle.dump(data, open(filename, 'wb')) 51 | truth = pickle.load(open(filename, 'rb')) 52 | try: 53 | np.testing.assert_almost_equal(data, truth) 54 | print(f'{filename} test passed') 55 | except AssertionError as ex: 56 | print(f'{filename} test failed {ex}') 57 | 58 | 59 | def do_pandas_test(filename, data): 60 | if not os.path.isfile(filename): 61 | data.to_pickle(filename) 62 | truth = pd.read_pickle(filename) 63 | try: 64 | pd.testing.assert_frame_equal(data, truth) 65 | print(f'{filename} pandas test passed') 66 | except AssertionError as ex: 67 | print(f'{filename} pandas test failed {ex}') 68 | 69 | 70 | class SqlLoader: 71 | def __init__(self, connection_string): 72 | engine = create_engine(connection_string) 73 | self.connection = engine.connect() 74 | 75 | def get_passengers(self): 76 | query = """ 77 | SELECT 78 | tbl_passengers.pid, 79 | tbl_passengers.pclass, 80 | tbl_passengers.sex, 81 | tbl_passengers.age, 82 | tbl_passengers.parch, 83 | tbl_passengers.sibsp, 84 | tbl_passengers.fare, 85 | tbl_passengers.embarked, 86 | tbl_passengers.name, 87 | tbl_targets.is_survived 88 | FROM 89 | tbl_passengers 90 | JOIN 91 | tbl_targets 92 | ON 93 | tbl_passengers.pid=tbl_targets.pid 94 | """ 95 | return pd.read_sql(query, con=self.connection) 96 | 97 | 98 | class TestLoader: 99 | def __init__(self, passengers_filename, real_loader): 100 | self.passengers_filename = passengers_filename 101 | self.real_loader = real_loader 102 | if not os.path.isfile(self.passengers_filename): 103 | df = self.real_loader.get_passengers() 104 | df.to_pickle(self.passengers_filename) 105 | 106 | def get_passengers(self): 107 | return pd.read_pickle(self.passengers_filename) 108 | 109 | 110 | class PassengerLoader: 111 | def __init__(self, loader, rare_titles=None): 112 | self.loader = loader 113 | self.rare_titles = rare_titles 114 | 115 | def get_passengers(self): 116 | passengers = [] 117 | for data in self.loader.get_passengers().itertuples(): 118 | # parch = Parents/Children, sibsp = Siblings/Spouses 119 | family_size = int(data.parch + data.sibsp) 120 | # Allen, Miss. Elisabeth Walton 121 | title = data.name.split(',')[1].split('.')[0].strip() 122 | passenger = Passenger( 123 | pid=int(data.pid), 124 | pclass=int(data.pclass), 125 | sex=str(data.sex), 126 | age=float(data.age), 127 | family_size=family_size, 128 | fare=float(data.fare), 129 | embarked=str(data.embarked), 130 | is_alone=1 if family_size == 1 else 0, 131 | title='rare' if title in self.rare_titles else title, 132 | is_survived=int(data.is_survived), 133 | ) 134 | passengers.append(passenger) 135 | return passengers 136 | 137 | 138 | class TitanicModel: 139 | def __init__(self): 140 | self.trained = False 141 | self.one_hot_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False) 142 | self.knn_imputer = KNNImputer(n_neighbors=5) 143 | self.robust_scaler = RobustScaler() 144 | self.predictor = LogisticRegression(random_state=0) 145 | 146 | def process_inputs(self, passengers): 147 | data = pd.DataFrame([v.dict() for v in passengers]) 148 | categorical_data = data[['embarked', 'sex', 'pclass', 'title', 'is_alone']] 149 | numerical_data = data[['age', 'fare', 'family_size']] 150 | if self.trained: 151 | categorical_data = self.one_hot_encoder.transform(categorical_data) 152 | numerical_data = self.robust_scaler.transform( 153 | self.knn_imputer.transform(numerical_data) 154 | ) 155 | else: 156 | categorical_data = self.one_hot_encoder.fit_transform(categorical_data) 157 | numerical_data = self.robust_scaler.fit_transform( 158 | self.knn_imputer.fit_transform(numerical_data) 159 | ) 160 | return np.hstack((categorical_data, numerical_data)) 161 | 162 | def train(self, passengers): 163 | targets = [v.is_survived for v in passengers] 164 | inputs = self.process_inputs(passengers) 165 | self.predictor.fit(inputs, targets) 166 | self.trained = True 167 | 168 | def estimate(self, passengers): 169 | return 1 170 | 171 | 172 | class TitanicModelCreator: 173 | def __init__(self, loader): 174 | self.loader = loader 175 | np.random.seed(42) 176 | 177 | def split_passengers(self, passengers): 178 | passengers_map = {p.pid: p for p in passengers} 179 | pids = [passenger.pid for passenger in passengers] 180 | targets = [passenger.is_survived for passenger in passengers] 181 | train_pids, test_pids = train_test_split(pids, stratify=targets, test_size=0.2) 182 | train_passengers = [passengers_map[pid] for pid in train_pids] 183 | test_passengers = [passengers_map[pid] for pid in test_pids] 184 | return train_passengers, test_passengers 185 | 186 | def run(self): 187 | passengers = self.loader.get_passengers() 188 | train_passengers, test_passengers = self.split_passengers(passengers) 189 | 190 | y_train = [v.is_survived for v in train_passengers] 191 | X_test = pd.DataFrame([v.dict() for v in test_passengers]) 192 | y_test = [v.is_survived for v in test_passengers] 193 | 194 | # --- TRAINING --- 195 | model = TitanicModel() 196 | model.train(train_passengers) 197 | 198 | X_train_processed = model.process_inputs(train_passengers) 199 | y_train_estimation = model.predictor.predict(X_train_processed) 200 | cm_train = confusion_matrix(y_train, y_train_estimation) 201 | 202 | # --- TESTING --- 203 | X_test_categorical = X_test[['embarked', 'sex', 'pclass', 'title', 'is_alone']] 204 | X_test_categorical_one_hot = model.one_hot_encoder.transform(X_test_categorical) 205 | 206 | X_test_numerical = X_test[['age', 'fare', 'family_size']] 207 | X_test_numerical_imputed = model.knn_imputer.transform(X_test_numerical) 208 | X_test_numerical_imputed_scaled = model.robust_scaler.transform( 209 | X_test_numerical_imputed 210 | ) 211 | 212 | X_test_processed = np.hstack( 213 | (X_test_categorical_one_hot, X_test_numerical_imputed_scaled) 214 | ) 215 | 216 | y_test_estimation = model.predictor.predict(X_test_processed) 217 | cm_test = confusion_matrix(y_test, y_test_estimation) 218 | 219 | print('cm_train', cm_train) 220 | print('cm_test', cm_test) 221 | 222 | do_test('../data/cm_test.pkl', cm_test) 223 | do_test('../data/cm_train.pkl', cm_train) 224 | do_test('../data/X_train_processed.pkl', X_train_processed) 225 | do_test('../data/X_test_processed.pkl', X_test_processed) 226 | 227 | do_pandas_test( 228 | '../data/df_no_tickets.pkl', pd.DataFrame([v.dict() for v in passengers]) 229 | ) 230 | 231 | 232 | def main(param: str = 'pass'): 233 | titanic_model_creator = TitanicModelCreator( 234 | loader=PassengerLoader( 235 | loader=SqlLoader(connection_string='sqlite:///../data/titanic.db'), 236 | rare_titles=RARE_TITLES, 237 | ) 238 | ) 239 | titanic_model_creator.run() 240 | 241 | 242 | def test_main(param: str = 'pass'): 243 | titanic_model_creator = TitanicModelCreator( 244 | loader=PassengerLoader( 245 | loader=TestLoader( 246 | passengers_filename='../data/passengers_with_is_survived.pkl', 247 | real_loader=SqlLoader(connection_string='sqlite:///../data/titanic.db'), 248 | ), 249 | rare_titles=RARE_TITLES, 250 | ) 251 | ) 252 | titanic_model_creator.run() 253 | 254 | 255 | if __name__ == "__main__": 256 | typer.run(test_main) 257 | -------------------------------------------------------------------------------- /Step19/titanic_model.sh: -------------------------------------------------------------------------------- 1 | python3.8 ./titanic_model.py $1 2 | -------------------------------------------------------------------------------- /Step20/README19.md: -------------------------------------------------------------------------------- 1 | ### Step 19: Move prediction to `TitanicModel` 2 | 3 | - Create the `estimate` function 4 | - Call `proccess_inputs` and `predictor.predict` in it 5 | - Remove all evaluation input processing code 6 | - Call `estimate` from `run` 7 | 8 | Because there was no separation of concerns the input processing code was duplicated and now that we moved it to its own location it can be removed. 9 | 10 | `X_train_processed` and `X_test_processed` do not exist any more so to pass the tests they need to be recreated. This is a good point to think about why this is necessary and find a different way to test behaviour. To keep the project short we set aside this but this would be a good place to introduce more tests. 11 | -------------------------------------------------------------------------------- /Step20/make_venv.sh: -------------------------------------------------------------------------------- 1 | python3.8 -m venv .venv 2 | source .venv/bin/activate 3 | pip3 install --upgrade pip 4 | pip3 install setuptools==57.1.0 5 | pip3 install wheel 6 | pip3 install -r requirements.txt 7 | -------------------------------------------------------------------------------- /Step20/requirements.txt: -------------------------------------------------------------------------------- 1 | typer 2 | scikit-learn 3 | jupyter 4 | pandas 5 | numpy 6 | sqlalchemy 7 | pydantic 8 | -------------------------------------------------------------------------------- /Step20/titanic_model.py: -------------------------------------------------------------------------------- 1 | import os 2 | import pickle 3 | import typer 4 | import numpy as np 5 | import pandas as pd 6 | from pydantic import BaseModel 7 | from sqlalchemy import create_engine 8 | 9 | from sklearn.model_selection import train_test_split 10 | from sklearn.linear_model import LogisticRegression 11 | from sklearn.preprocessing import RobustScaler 12 | from sklearn.preprocessing import OneHotEncoder 13 | from sklearn.impute import KNNImputer 14 | from sklearn.metrics import confusion_matrix 15 | 16 | 17 | RARE_TITLES = { 18 | 'Capt', 19 | 'Col', 20 | 'Don', 21 | 'Dona', 22 | 'Dr', 23 | 'Jonkheer', 24 | 'Lady', 25 | 'Major', 26 | 'Mlle', 27 | 'Mme', 28 | 'Ms', 29 | 'Rev', 30 | 'Sir', 31 | 'the Countess', 32 | } 33 | 34 | 35 | class Passenger(BaseModel): 36 | pid: int 37 | pclass: int 38 | sex: str 39 | age: float 40 | family_size: int 41 | fare: float 42 | embarked: str 43 | is_alone: int 44 | title: str 45 | is_survived: int 46 | 47 | 48 | def do_test(filename, data): 49 | if not os.path.isfile(filename): 50 | pickle.dump(data, open(filename, 'wb')) 51 | truth = pickle.load(open(filename, 'rb')) 52 | try: 53 | np.testing.assert_almost_equal(data, truth) 54 | print(f'{filename} test passed') 55 | except AssertionError as ex: 56 | print(f'{filename} test failed {ex}') 57 | 58 | 59 | def do_pandas_test(filename, data): 60 | if not os.path.isfile(filename): 61 | data.to_pickle(filename) 62 | truth = pd.read_pickle(filename) 63 | try: 64 | pd.testing.assert_frame_equal(data, truth) 65 | print(f'{filename} pandas test passed') 66 | except AssertionError as ex: 67 | print(f'{filename} pandas test failed {ex}') 68 | 69 | 70 | class SqlLoader: 71 | def __init__(self, connection_string): 72 | engine = create_engine(connection_string) 73 | self.connection = engine.connect() 74 | 75 | def get_passengers(self): 76 | query = """ 77 | SELECT 78 | tbl_passengers.pid, 79 | tbl_passengers.pclass, 80 | tbl_passengers.sex, 81 | tbl_passengers.age, 82 | tbl_passengers.parch, 83 | tbl_passengers.sibsp, 84 | tbl_passengers.fare, 85 | tbl_passengers.embarked, 86 | tbl_passengers.name, 87 | tbl_targets.is_survived 88 | FROM 89 | tbl_passengers 90 | JOIN 91 | tbl_targets 92 | ON 93 | tbl_passengers.pid=tbl_targets.pid 94 | """ 95 | return pd.read_sql(query, con=self.connection) 96 | 97 | 98 | class TestLoader: 99 | def __init__(self, passengers_filename, real_loader): 100 | self.passengers_filename = passengers_filename 101 | self.real_loader = real_loader 102 | if not os.path.isfile(self.passengers_filename): 103 | df = self.real_loader.get_passengers() 104 | df.to_pickle(self.passengers_filename) 105 | 106 | def get_passengers(self): 107 | return pd.read_pickle(self.passengers_filename) 108 | 109 | 110 | class PassengerLoader: 111 | def __init__(self, loader, rare_titles=None): 112 | self.loader = loader 113 | self.rare_titles = rare_titles 114 | 115 | def get_passengers(self): 116 | passengers = [] 117 | for data in self.loader.get_passengers().itertuples(): 118 | # parch = Parents/Children, sibsp = Siblings/Spouses 119 | family_size = int(data.parch + data.sibsp) 120 | # Allen, Miss. Elisabeth Walton 121 | title = data.name.split(',')[1].split('.')[0].strip() 122 | passenger = Passenger( 123 | pid=int(data.pid), 124 | pclass=int(data.pclass), 125 | sex=str(data.sex), 126 | age=float(data.age), 127 | family_size=family_size, 128 | fare=float(data.fare), 129 | embarked=str(data.embarked), 130 | is_alone=1 if family_size == 1 else 0, 131 | title='rare' if title in self.rare_titles else title, 132 | is_survived=int(data.is_survived), 133 | ) 134 | passengers.append(passenger) 135 | return passengers 136 | 137 | 138 | class TitanicModel: 139 | def __init__(self): 140 | self.trained = False 141 | self.one_hot_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False) 142 | self.knn_imputer = KNNImputer(n_neighbors=5) 143 | self.robust_scaler = RobustScaler() 144 | self.predictor = LogisticRegression(random_state=0) 145 | 146 | def process_inputs(self, passengers): 147 | data = pd.DataFrame([v.dict() for v in passengers]) 148 | categorical_data = data[['embarked', 'sex', 'pclass', 'title', 'is_alone']] 149 | numerical_data = data[['age', 'fare', 'family_size']] 150 | if self.trained: 151 | categorical_data = self.one_hot_encoder.transform(categorical_data) 152 | numerical_data = self.robust_scaler.transform( 153 | self.knn_imputer.transform(numerical_data) 154 | ) 155 | else: 156 | categorical_data = self.one_hot_encoder.fit_transform(categorical_data) 157 | numerical_data = self.robust_scaler.fit_transform( 158 | self.knn_imputer.fit_transform(numerical_data) 159 | ) 160 | return np.hstack((categorical_data, numerical_data)) 161 | 162 | def train(self, passengers): 163 | targets = [v.is_survived for v in passengers] 164 | inputs = self.process_inputs(passengers) 165 | self.predictor.fit(inputs, targets) 166 | self.trained = True 167 | 168 | def estimate(self, passengers): 169 | inputs = self.process_inputs(passengers) 170 | return self.predictor.predict(inputs) 171 | 172 | 173 | class TitanicModelCreator: 174 | def __init__(self, loader): 175 | self.loader = loader 176 | np.random.seed(42) 177 | 178 | def split_passengers(self, passengers): 179 | passengers_map = {p.pid: p for p in passengers} 180 | pids = [passenger.pid for passenger in passengers] 181 | targets = [passenger.is_survived for passenger in passengers] 182 | train_pids, test_pids = train_test_split(pids, stratify=targets, test_size=0.2) 183 | train_passengers = [passengers_map[pid] for pid in train_pids] 184 | test_passengers = [passengers_map[pid] for pid in test_pids] 185 | return train_passengers, test_passengers 186 | 187 | def run(self): 188 | passengers = self.loader.get_passengers() 189 | train_passengers, test_passengers = self.split_passengers(passengers) 190 | 191 | # --- TRAINING --- 192 | model = TitanicModel() 193 | model.train(train_passengers) 194 | y_train_estimation = model.estimate(train_passengers) 195 | cm_train = confusion_matrix( 196 | [v.is_survived for v in train_passengers], y_train_estimation 197 | ) 198 | 199 | # --- TESTING --- 200 | y_test_estimation = model.estimate(test_passengers) 201 | cm_test = confusion_matrix( 202 | [v.is_survived for v in test_passengers], y_test_estimation 203 | ) 204 | 205 | print('cm_train', cm_train) 206 | print('cm_test', cm_test) 207 | 208 | do_test('../data/cm_test.pkl', cm_test) 209 | do_test('../data/cm_train.pkl', cm_train) 210 | X_train_processed = model.process_inputs(train_passengers) 211 | do_test('../data/X_train_processed.pkl', X_train_processed) 212 | X_test_processed = model.process_inputs(test_passengers) 213 | do_test('../data/X_test_processed.pkl', X_test_processed) 214 | 215 | do_pandas_test( 216 | '../data/df_no_tickets.pkl', pd.DataFrame([v.dict() for v in passengers]) 217 | ) 218 | 219 | 220 | def main(param: str = 'pass'): 221 | titanic_model_creator = TitanicModelCreator( 222 | loader=PassengerLoader( 223 | loader=SqlLoader(connection_string='sqlite:///../data/titanic.db'), 224 | rare_titles=RARE_TITLES, 225 | ) 226 | ) 227 | titanic_model_creator.run() 228 | 229 | 230 | def test_main(param: str = 'pass'): 231 | titanic_model_creator = TitanicModelCreator( 232 | loader=PassengerLoader( 233 | loader=TestLoader( 234 | passengers_filename='../data/passengers_with_is_survived.pkl', 235 | real_loader=SqlLoader(connection_string='sqlite:///../data/titanic.db'), 236 | ), 237 | rare_titles=RARE_TITLES, 238 | ) 239 | ) 240 | titanic_model_creator.run() 241 | 242 | 243 | if __name__ == "__main__": 244 | typer.run(test_main) 245 | -------------------------------------------------------------------------------- /Step20/titanic_model.sh: -------------------------------------------------------------------------------- 1 | python3.8 ./titanic_model.py $1 2 | -------------------------------------------------------------------------------- /Step21/README20.md: -------------------------------------------------------------------------------- 1 | ### Step 20: Save model and move tests to custom model savers 2 | 3 | - Create `ModelSaver` that has a `save_model` interface that accepts a model and a result object 4 | - Pickle the model and the result to a file 5 | - Create `TestModelSaver` that has the same interface 6 | - Move the testing code to the `save_model` function 7 | - Add `model_saver` property to `TitanicModelCreator` and call it after the evaluation code 8 | - Add an instance of `ModelSaver` and `TestModelSaver` respectively in `main` and `test_main` to the construction of `TitanicModelCreator` 9 | 10 | Currently `TitanicModelCreator` contains its own testing, while this is intended to run in production. It also have no way to save the model. We will introduce the concept of `ModelSaver` here, anything that need to be preserved after the model training need to be passed to this class. 11 | 12 | We will also move testing into a specific `TestModelSaver` that will instead of saving the model, will run the tests that were otherwise be in `run()`. This way the same code can run in production and in testing without change. 13 | -------------------------------------------------------------------------------- /Step21/make_venv.sh: -------------------------------------------------------------------------------- 1 | python3.8 -m venv .venv 2 | source .venv/bin/activate 3 | pip3 install --upgrade pip 4 | pip3 install setuptools==57.1.0 5 | pip3 install wheel 6 | pip3 install -r requirements.txt 7 | -------------------------------------------------------------------------------- /Step21/requirements.txt: -------------------------------------------------------------------------------- 1 | typer 2 | scikit-learn 3 | jupyter 4 | pandas 5 | numpy 6 | sqlalchemy 7 | pydantic 8 | -------------------------------------------------------------------------------- /Step21/titanic_model.py: -------------------------------------------------------------------------------- 1 | import os 2 | import pickle 3 | import typer 4 | import numpy as np 5 | import pandas as pd 6 | from pydantic import BaseModel 7 | from sqlalchemy import create_engine 8 | 9 | from sklearn.model_selection import train_test_split 10 | from sklearn.linear_model import LogisticRegression 11 | from sklearn.preprocessing import RobustScaler 12 | from sklearn.preprocessing import OneHotEncoder 13 | from sklearn.impute import KNNImputer 14 | from sklearn.metrics import confusion_matrix 15 | 16 | 17 | RARE_TITLES = { 18 | 'Capt', 19 | 'Col', 20 | 'Don', 21 | 'Dona', 22 | 'Dr', 23 | 'Jonkheer', 24 | 'Lady', 25 | 'Major', 26 | 'Mlle', 27 | 'Mme', 28 | 'Ms', 29 | 'Rev', 30 | 'Sir', 31 | 'the Countess', 32 | } 33 | 34 | 35 | class Passenger(BaseModel): 36 | pid: int 37 | pclass: int 38 | sex: str 39 | age: float 40 | family_size: int 41 | fare: float 42 | embarked: str 43 | is_alone: int 44 | title: str 45 | is_survived: int 46 | 47 | 48 | def do_test(filename, data): 49 | if not os.path.isfile(filename): 50 | pickle.dump(data, open(filename, 'wb')) 51 | truth = pickle.load(open(filename, 'rb')) 52 | try: 53 | np.testing.assert_almost_equal(data, truth) 54 | print(f'{filename} test passed') 55 | except AssertionError as ex: 56 | print(f'{filename} test failed {ex}') 57 | 58 | 59 | def do_pandas_test(filename, data): 60 | if not os.path.isfile(filename): 61 | data.to_pickle(filename) 62 | truth = pd.read_pickle(filename) 63 | try: 64 | pd.testing.assert_frame_equal(data, truth) 65 | print(f'{filename} pandas test passed') 66 | except AssertionError as ex: 67 | print(f'{filename} pandas test failed {ex}') 68 | 69 | 70 | class SqlLoader: 71 | def __init__(self, connection_string): 72 | engine = create_engine(connection_string) 73 | self.connection = engine.connect() 74 | 75 | def get_passengers(self): 76 | query = """ 77 | SELECT 78 | tbl_passengers.pid, 79 | tbl_passengers.pclass, 80 | tbl_passengers.sex, 81 | tbl_passengers.age, 82 | tbl_passengers.parch, 83 | tbl_passengers.sibsp, 84 | tbl_passengers.fare, 85 | tbl_passengers.embarked, 86 | tbl_passengers.name, 87 | tbl_targets.is_survived 88 | FROM 89 | tbl_passengers 90 | JOIN 91 | tbl_targets 92 | ON 93 | tbl_passengers.pid=tbl_targets.pid 94 | """ 95 | return pd.read_sql(query, con=self.connection) 96 | 97 | 98 | class TestLoader: 99 | def __init__(self, passengers_filename, real_loader): 100 | self.passengers_filename = passengers_filename 101 | self.real_loader = real_loader 102 | if not os.path.isfile(self.passengers_filename): 103 | df = self.real_loader.get_passengers() 104 | df.to_pickle(self.passengers_filename) 105 | 106 | def get_passengers(self): 107 | return pd.read_pickle(self.passengers_filename) 108 | 109 | 110 | class ModelSaver: 111 | def __init__(self, model_filename, result_filename): 112 | self.model_filename = model_filename 113 | self.result_filename = result_filename 114 | 115 | def save_model(self, model, result): 116 | pickle.dump(model, open(self.filename, 'wb')) 117 | pickle.dump(result, open(self.result_filename, 'wb')) 118 | 119 | 120 | class TestModelSaver: 121 | def __init__(self): 122 | pass 123 | 124 | def save_model(self, model, result): 125 | do_test('../data/cm_test.pkl', result['cm_test']) 126 | do_test('../data/cm_train.pkl', result['cm_train']) 127 | X_train_processed = model.process_inputs(result['train_passengers']) 128 | do_test('../data/X_train_processed.pkl', X_train_processed) 129 | X_test_processed = model.process_inputs(result['test_passengers']) 130 | do_test('../data/X_test_processed.pkl', X_test_processed) 131 | 132 | 133 | class PassengerLoader: 134 | def __init__(self, loader, rare_titles=None): 135 | self.loader = loader 136 | self.rare_titles = rare_titles 137 | 138 | def get_passengers(self): 139 | passengers = [] 140 | for data in self.loader.get_passengers().itertuples(): 141 | # parch = Parents/Children, sibsp = Siblings/Spouses 142 | family_size = int(data.parch + data.sibsp) 143 | # Allen, Miss. Elisabeth Walton 144 | title = data.name.split(',')[1].split('.')[0].strip() 145 | passenger = Passenger( 146 | pid=int(data.pid), 147 | pclass=int(data.pclass), 148 | sex=str(data.sex), 149 | age=float(data.age), 150 | family_size=family_size, 151 | fare=float(data.fare), 152 | embarked=str(data.embarked), 153 | is_alone=1 if family_size == 1 else 0, 154 | title='rare' if title in self.rare_titles else title, 155 | is_survived=int(data.is_survived), 156 | ) 157 | passengers.append(passenger) 158 | return passengers 159 | 160 | 161 | class TitanicModel: 162 | def __init__(self): 163 | self.trained = False 164 | self.one_hot_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False) 165 | self.knn_imputer = KNNImputer(n_neighbors=5) 166 | self.robust_scaler = RobustScaler() 167 | self.predictor = LogisticRegression(random_state=0) 168 | 169 | def process_inputs(self, passengers): 170 | data = pd.DataFrame([v.dict() for v in passengers]) 171 | categorical_data = data[['embarked', 'sex', 'pclass', 'title', 'is_alone']] 172 | numerical_data = data[['age', 'fare', 'family_size']] 173 | if self.trained: 174 | categorical_data = self.one_hot_encoder.transform(categorical_data) 175 | numerical_data = self.robust_scaler.transform( 176 | self.knn_imputer.transform(numerical_data) 177 | ) 178 | else: 179 | categorical_data = self.one_hot_encoder.fit_transform(categorical_data) 180 | numerical_data = self.robust_scaler.fit_transform( 181 | self.knn_imputer.fit_transform(numerical_data) 182 | ) 183 | return np.hstack((categorical_data, numerical_data)) 184 | 185 | def train(self, passengers): 186 | targets = [v.is_survived for v in passengers] 187 | inputs = self.process_inputs(passengers) 188 | self.predictor.fit(inputs, targets) 189 | self.trained = True 190 | 191 | def estimate(self, passengers): 192 | inputs = self.process_inputs(passengers) 193 | return self.predictor.predict(inputs) 194 | 195 | 196 | class TitanicModelCreator: 197 | def __init__(self, loader, model_saver): 198 | self.loader = loader 199 | self.model_saver = model_saver 200 | np.random.seed(42) 201 | 202 | def split_passengers(self, passengers): 203 | passengers_map = {p.pid: p for p in passengers} 204 | pids = [passenger.pid for passenger in passengers] 205 | targets = [passenger.is_survived for passenger in passengers] 206 | train_pids, test_pids = train_test_split(pids, stratify=targets, test_size=0.2) 207 | train_passengers = [passengers_map[pid] for pid in train_pids] 208 | test_passengers = [passengers_map[pid] for pid in test_pids] 209 | return train_passengers, test_passengers 210 | 211 | def run(self): 212 | passengers = self.loader.get_passengers() 213 | train_passengers, test_passengers = self.split_passengers(passengers) 214 | 215 | # --- TRAINING --- 216 | model = TitanicModel() 217 | model.train(train_passengers) 218 | y_train_estimation = model.estimate(train_passengers) 219 | cm_train = confusion_matrix( 220 | [v.is_survived for v in train_passengers], y_train_estimation 221 | ) 222 | 223 | # --- TESTING --- 224 | y_test_estimation = model.estimate(test_passengers) 225 | cm_test = confusion_matrix( 226 | [v.is_survived for v in test_passengers], y_test_estimation 227 | ) 228 | 229 | self.model_saver.save_model( 230 | model=model, 231 | result={ 232 | 'cm_train': cm_train, 233 | 'cm_test': cm_test, 234 | 'train_passengers': train_passengers, 235 | 'test_passengers': test_passengers, 236 | }, 237 | ) 238 | 239 | 240 | def main(param: str = 'pass'): 241 | titanic_model_creator = TitanicModelCreator( 242 | loader=PassengerLoader( 243 | loader=SqlLoader(connection_string='sqlite:///../data/titanic.db'), 244 | rare_titles=RARE_TITLES, 245 | ), 246 | model_saver=ModelSaver( 247 | model_filename='../data/real_model.pkl', 248 | result_filename='../data/real_result.pkl', 249 | ), 250 | ) 251 | titanic_model_creator.run() 252 | 253 | 254 | def test_main(param: str = 'pass'): 255 | titanic_model_creator = TitanicModelCreator( 256 | loader=PassengerLoader( 257 | loader=TestLoader( 258 | passengers_filename='../data/passengers_with_is_survived.pkl', 259 | real_loader=SqlLoader(connection_string='sqlite:///../data/titanic.db'), 260 | ), 261 | rare_titles=RARE_TITLES, 262 | ), 263 | model_saver=TestModelSaver(), 264 | ) 265 | titanic_model_creator.run() 266 | 267 | 268 | if __name__ == "__main__": 269 | typer.run(test_main) 270 | -------------------------------------------------------------------------------- /Step21/titanic_model.sh: -------------------------------------------------------------------------------- 1 | python3.8 ./titanic_model.py $1 2 | -------------------------------------------------------------------------------- /Step22/README21.md: -------------------------------------------------------------------------------- 1 | ### Step 21: Enable training of different models 2 | 3 | - Add `model` property to `TitanicModelCreator` and use in `run()` that instead of the local `TitanicModel` instance. 4 | - Add `TitanicModel` instantiation to the creation of `TitanicModelCreator` in both `main` and `test_main` 5 | - Expose parts of `TitanicModel` (predictor, processing parameter) 6 | 7 | At this point the refactoring is pretty much finished. This last step enables the creation of different models. Use existing implementations as templates to create new shell scripts, main functions (contexts) for each experiment that uses new Loaders to create new datasets. Write different test context to make sure the changes you do are as intended.As more experiments emerge, you will see patterns and opportunities to extract common behaviour from similar implementation while still maintaining validity through thes tests. This allows restructuring your code on the fly and find out what is the most convenient architecture for your system. Most problems in these systems are unforeseeable, there is no possibility to figure out the best structure before you start implementation. This require a workflow that enables radical changes even at later stages of the project. Clean Architecture, end-to-end testing and maintaining code quality provides exactly this feature at very low effort. 8 | 9 | Next steps: 10 | 11 | - Use different data: 12 | - Update `SqlLoader` to retrieve different data 13 | - Update `Passenger` class to contain this new data 14 | - Update `PassengerLoader` class to process this new data into the classes 15 | - Update `process_inputs` to create features out of this new data 16 | - Use different features 17 | - Update `process_inputs` in `TitanicModel`, expose parameters as needed 18 | - Use different model: 19 | - Use different `predictor` in `TitanicModel` 20 | -------------------------------------------------------------------------------- /Step22/make_venv.sh: -------------------------------------------------------------------------------- 1 | python3.8 -m venv .venv 2 | source .venv/bin/activate 3 | pip3 install --upgrade pip 4 | pip3 install setuptools==57.1.0 5 | pip3 install wheel 6 | pip3 install -r requirements.txt 7 | -------------------------------------------------------------------------------- /Step22/requirements.txt: -------------------------------------------------------------------------------- 1 | typer 2 | scikit-learn 3 | jupyter 4 | pandas 5 | numpy 6 | sqlalchemy 7 | pydantic 8 | -------------------------------------------------------------------------------- /Step22/titanic_model.py: -------------------------------------------------------------------------------- 1 | import os 2 | import pickle 3 | import typer 4 | import numpy as np 5 | import pandas as pd 6 | from pydantic import BaseModel 7 | from sqlalchemy import create_engine 8 | 9 | from sklearn.model_selection import train_test_split 10 | from sklearn.linear_model import LogisticRegression 11 | from sklearn.preprocessing import RobustScaler 12 | from sklearn.preprocessing import OneHotEncoder 13 | from sklearn.impute import KNNImputer 14 | from sklearn.metrics import confusion_matrix 15 | 16 | 17 | RARE_TITLES = { 18 | 'Capt', 19 | 'Col', 20 | 'Don', 21 | 'Dona', 22 | 'Dr', 23 | 'Jonkheer', 24 | 'Lady', 25 | 'Major', 26 | 'Mlle', 27 | 'Mme', 28 | 'Ms', 29 | 'Rev', 30 | 'Sir', 31 | 'the Countess', 32 | } 33 | 34 | 35 | class Passenger(BaseModel): 36 | pid: int 37 | pclass: int 38 | sex: str 39 | age: float 40 | family_size: int 41 | fare: float 42 | embarked: str 43 | is_alone: int 44 | title: str 45 | is_survived: int 46 | 47 | 48 | def do_test(filename, data): 49 | if not os.path.isfile(filename): 50 | pickle.dump(data, open(filename, 'wb')) 51 | truth = pickle.load(open(filename, 'rb')) 52 | try: 53 | np.testing.assert_almost_equal(data, truth) 54 | print(f'{filename} test passed') 55 | except AssertionError as ex: 56 | print(f'{filename} test failed {ex}') 57 | 58 | 59 | def do_pandas_test(filename, data): 60 | if not os.path.isfile(filename): 61 | data.to_pickle(filename) 62 | truth = pd.read_pickle(filename) 63 | try: 64 | pd.testing.assert_frame_equal(data, truth) 65 | print(f'{filename} pandas test passed') 66 | except AssertionError as ex: 67 | print(f'{filename} pandas test failed {ex}') 68 | 69 | 70 | class SqlLoader: 71 | def __init__(self, connection_string): 72 | engine = create_engine(connection_string) 73 | self.connection = engine.connect() 74 | 75 | def get_passengers(self): 76 | query = """ 77 | SELECT 78 | tbl_passengers.pid, 79 | tbl_passengers.pclass, 80 | tbl_passengers.sex, 81 | tbl_passengers.age, 82 | tbl_passengers.parch, 83 | tbl_passengers.sibsp, 84 | tbl_passengers.fare, 85 | tbl_passengers.embarked, 86 | tbl_passengers.name, 87 | tbl_targets.is_survived 88 | FROM 89 | tbl_passengers 90 | JOIN 91 | tbl_targets 92 | ON 93 | tbl_passengers.pid=tbl_targets.pid 94 | """ 95 | return pd.read_sql(query, con=self.connection) 96 | 97 | 98 | class TestLoader: 99 | def __init__(self, passengers_filename, real_loader): 100 | self.passengers_filename = passengers_filename 101 | self.real_loader = real_loader 102 | if not os.path.isfile(self.passengers_filename): 103 | df = self.real_loader.get_passengers() 104 | df.to_pickle(self.passengers_filename) 105 | 106 | def get_passengers(self): 107 | return pd.read_pickle(self.passengers_filename) 108 | 109 | 110 | class ModelSaver: 111 | def __init__(self, model_filename, result_filename): 112 | self.model_filename = model_filename 113 | self.result_filename = result_filename 114 | 115 | def save_model(self, model, result): 116 | pickle.dump(model, open(self.filename, 'wb')) 117 | pickle.dump(result, open(self.result_filename, 'wb')) 118 | 119 | 120 | class TestModelSaver: 121 | def __init__(self): 122 | pass 123 | 124 | def save_model(self, model, result): 125 | do_test('../data/cm_test.pkl', result['cm_test']) 126 | do_test('../data/cm_train.pkl', result['cm_train']) 127 | X_train_processed = model.process_inputs(result['train_passengers']) 128 | do_test('../data/X_train_processed.pkl', X_train_processed) 129 | X_test_processed = model.process_inputs(result['test_passengers']) 130 | do_test('../data/X_test_processed.pkl', X_test_processed) 131 | 132 | 133 | class PassengerLoader: 134 | def __init__(self, loader, rare_titles=None): 135 | self.loader = loader 136 | self.rare_titles = rare_titles 137 | 138 | def get_passengers(self): 139 | passengers = [] 140 | for data in self.loader.get_passengers().itertuples(): 141 | # parch = Parents/Children, sibsp = Siblings/Spouses 142 | family_size = int(data.parch + data.sibsp) 143 | # Allen, Miss. Elisabeth Walton 144 | title = data.name.split(',')[1].split('.')[0].strip() 145 | passenger = Passenger( 146 | pid=int(data.pid), 147 | pclass=int(data.pclass), 148 | sex=str(data.sex), 149 | age=float(data.age), 150 | family_size=family_size, 151 | fare=float(data.fare), 152 | embarked=str(data.embarked), 153 | is_alone=1 if family_size == 1 else 0, 154 | title='rare' if title in self.rare_titles else title, 155 | is_survived=int(data.is_survived), 156 | ) 157 | passengers.append(passenger) 158 | return passengers 159 | 160 | 161 | class TitanicModel: 162 | def __init__(self, n_neighbors=5, predictor=None): 163 | if predictor is None: 164 | predictor = LogisticRegression(random_state=0) 165 | self.trained = False 166 | self.one_hot_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False) 167 | self.knn_imputer = KNNImputer(n_neighbors=n_neighbors) 168 | self.robust_scaler = RobustScaler() 169 | self.predictor = predictor 170 | 171 | def process_inputs(self, passengers): 172 | data = pd.DataFrame([v.dict() for v in passengers]) 173 | categorical_data = data[['embarked', 'sex', 'pclass', 'title', 'is_alone']] 174 | numerical_data = data[['age', 'fare', 'family_size']] 175 | if self.trained: 176 | categorical_data = self.one_hot_encoder.transform(categorical_data) 177 | numerical_data = self.robust_scaler.transform( 178 | self.knn_imputer.transform(numerical_data) 179 | ) 180 | else: 181 | categorical_data = self.one_hot_encoder.fit_transform(categorical_data) 182 | numerical_data = self.robust_scaler.fit_transform( 183 | self.knn_imputer.fit_transform(numerical_data) 184 | ) 185 | return np.hstack((categorical_data, numerical_data)) 186 | 187 | def train(self, passengers): 188 | targets = [v.is_survived for v in passengers] 189 | inputs = self.process_inputs(passengers) 190 | self.predictor.fit(inputs, targets) 191 | self.trained = True 192 | 193 | def estimate(self, passengers): 194 | inputs = self.process_inputs(passengers) 195 | return self.predictor.predict(inputs) 196 | 197 | 198 | class TitanicModelCreator: 199 | def __init__(self, loader, model, model_saver): 200 | self.loader = loader 201 | self.model = model 202 | self.model_saver = model_saver 203 | np.random.seed(42) 204 | 205 | def split_passengers(self, passengers): 206 | passengers_map = {p.pid: p for p in passengers} 207 | pids = [passenger.pid for passenger in passengers] 208 | targets = [passenger.is_survived for passenger in passengers] 209 | train_pids, test_pids = train_test_split(pids, stratify=targets, test_size=0.2) 210 | train_passengers = [passengers_map[pid] for pid in train_pids] 211 | test_passengers = [passengers_map[pid] for pid in test_pids] 212 | return train_passengers, test_passengers 213 | 214 | def run(self): 215 | passengers = self.loader.get_passengers() 216 | train_passengers, test_passengers = self.split_passengers(passengers) 217 | 218 | # --- TRAINING --- 219 | self.model.train(train_passengers) 220 | y_train_estimation = self.model.estimate(train_passengers) 221 | cm_train = confusion_matrix( 222 | [v.is_survived for v in train_passengers], y_train_estimation 223 | ) 224 | 225 | # --- TESTING --- 226 | y_test_estimation = self.model.estimate(test_passengers) 227 | cm_test = confusion_matrix( 228 | [v.is_survived for v in test_passengers], y_test_estimation 229 | ) 230 | 231 | self.model_saver.save_model( 232 | model=self.model, 233 | result={ 234 | 'cm_train': cm_train, 235 | 'cm_test': cm_test, 236 | 'train_passengers': train_passengers, 237 | 'test_passengers': test_passengers, 238 | }, 239 | ) 240 | 241 | 242 | def main(param: str = 'pass'): 243 | titanic_model_creator = TitanicModelCreator( 244 | loader=PassengerLoader( 245 | loader=SqlLoader(connection_string='sqlite:///../data/titanic.db'), 246 | rare_titles=RARE_TITLES, 247 | ), 248 | model=TitanicModel(), 249 | model_saver=ModelSaver( 250 | model_filename='../data/real_model.pkl', 251 | result_filename='../data/real_result.pkl', 252 | ), 253 | ) 254 | titanic_model_creator.run() 255 | 256 | 257 | def test_main(param: str = 'pass'): 258 | titanic_model_creator = TitanicModelCreator( 259 | loader=PassengerLoader( 260 | loader=TestLoader( 261 | passengers_filename='../data/passengers_with_is_survived.pkl', 262 | real_loader=SqlLoader(connection_string='sqlite:///../data/titanic.db'), 263 | ), 264 | rare_titles=RARE_TITLES, 265 | ), 266 | model=TitanicModel(n_neighbors=5, predictor=LogisticRegression(random_state=0)), 267 | model_saver=TestModelSaver(), 268 | ) 269 | titanic_model_creator.run() 270 | 271 | 272 | if __name__ == "__main__": 273 | typer.run(test_main) 274 | -------------------------------------------------------------------------------- /Step22/titanic_model.sh: -------------------------------------------------------------------------------- 1 | python3.8 ./titanic_model.py $1 2 | -------------------------------------------------------------------------------- /create_branch.sh: -------------------------------------------------------------------------------- 1 | 2 | switch=$1 3 | 4 | if [ -z "$switch" ]; then 5 | echo "use ./create_branch -do" 6 | else 7 | cp Step02/* Step01 8 | cp Step03/* Step02 9 | cp Step04/* Step03 10 | cp Step05/* Step04 11 | cp Step06/* Step05 12 | cp Step07/* Step06 13 | cp Step08/* Step07 14 | cp Step09/* Step08 15 | cp Step10/* Step09 16 | cp Step11/* Step10 17 | cp Step12/* Step11 18 | cp Step13/* Step12 19 | cp Step14/* Step13 20 | cp Step15/* Step14 21 | cp Step16/* Step15 22 | cp Step17/* Step16 23 | cp Step18/* Step17 24 | cp Step19/* Step18 25 | cp Step20/* Step19 26 | cp Step21/* Step20 27 | cp Step22/* Step21 28 | fi 29 | -------------------------------------------------------------------------------- /create_instructions.sh: -------------------------------------------------------------------------------- 1 | find . -name "README??.md" | sort | xargs cat > INSTRUCTIONS.md 2 | -------------------------------------------------------------------------------- /create_sqlite.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import pandas as pd 3 | import typer 4 | from sklearn.datasets import fetch_openml 5 | from sqlalchemy import create_engine 6 | 7 | 8 | def main(): 9 | print('loading data') 10 | df, targets = fetch_openml("titanic", version=1, as_frame=True, return_X_y=True) 11 | targets = pd.DataFrame(np.array([int(v) for v in targets]), columns=['is_survived']) 12 | 13 | print('creating db') 14 | engine = create_engine('sqlite:///titanic.db', echo=True) 15 | sqlite_connection = engine.connect() 16 | 17 | print('saving passengers') 18 | df.to_sql('tbl_passengers', sqlite_connection, index_label='pid') 19 | print('saving targets') 20 | targets.to_sql('tbl_targets', sqlite_connection, index_label='pid') 21 | 22 | print('closing db') 23 | sqlite_connection.close() 24 | 25 | print('done') 26 | 27 | 28 | if __name__ == "__main__": 29 | typer.run(main) 30 | -------------------------------------------------------------------------------- /data/titanic.db: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/xLaszlo/CQ4DS-notebook-sklearn-refactoring-exercise/16059caade85caeb6f4ccf7e492e1d67f87d28a5/data/titanic.db -------------------------------------------------------------------------------- /make_venv.sh: -------------------------------------------------------------------------------- 1 | python3.8 -m venv .venv 2 | source .venv/bin/activate 3 | pip3 install --upgrade pip 4 | pip3 install setuptools==57.1.0 5 | pip3 install wheel 6 | pip3 install -r requirements.txt 7 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | typer 2 | scikit-learn 3 | jupyter 4 | pandas 5 | numpy 6 | sqlalchemy 7 | pydantic 8 | black 9 | -------------------------------------------------------------------------------- /test_all.sh: -------------------------------------------------------------------------------- 1 | rm data/*.pkl 2 | echo "Step05" && cd Step05 && ./titanic_model.sh && cd .. 3 | echo "Step06" && cd Step06 && ./titanic_model.sh && cd .. 4 | echo "Step07" && cd Step07 && ./titanic_model.sh && cd .. 5 | echo "Step08" && cd Step08 && ./titanic_model.sh && cd .. 6 | echo "Step09" && cd Step09 && ./titanic_model.sh && cd .. 7 | echo "Step10" && cd Step10 && ./titanic_model.sh && cd .. 8 | echo "Step11" && cd Step11 && ./titanic_model.sh && cd .. 9 | echo "Step12" && cd Step12 && ./titanic_model.sh && cd .. 10 | echo "Step13" && cd Step13 && ./titanic_model.sh && cd .. 11 | echo "Step14" && cd Step14 && ./titanic_model.sh && cd .. 12 | echo "Step15" && cd Step15 && ./titanic_model.sh && cd .. 13 | echo "Step16" && cd Step16 && ./titanic_model.sh && cd .. 14 | echo "Step17" && cd Step17 && ./titanic_model.sh && cd .. 15 | echo "Step18" && cd Step18 && ./titanic_model.sh && cd .. 16 | echo "Step19" && cd Step19 && ./titanic_model.sh && cd .. 17 | echo "Step20" && cd Step20 && ./titanic_model.sh && cd .. 18 | echo "Step21" && cd Step21 && ./titanic_model.sh && cd .. 19 | echo "Step22" && cd Step22 && ./titanic_model.sh && cd .. 20 | --------------------------------------------------------------------------------