├── .github └── ISSUE_TEMPLATE │ ├── bug_report.md │ ├── feature_request.md │ └── feedback-or-question.md ├── .gitignore ├── 01-introduction ├── README.md ├── comparison.ipynb ├── homework.ipynb └── housing.csv ├── 02-regression ├── README.md ├── homework.ipynb └── housing.csv ├── 03-classification ├── README.md ├── data.csv └── homework.ipynb ├── 04-evaluation ├── README.md └── homework.ipynb ├── 05-deployment ├── Dockerfile ├── Pipfile ├── Pipfile.lock ├── README.md ├── app.py ├── dv.bin ├── homework.ipynb └── model1.bin ├── 06-trees ├── README.md ├── homework.ipynb ├── housing.csv └── ozkary-decision-tree.png ├── 07-midterm-project └── README.md ├── 08-deep-learning ├── README.md ├── homework.ipynb └── images │ ├── cnn_metrics.png │ ├── cnn_metrics_augmented.png │ ├── ozkary-convolutional-neural-network-bk.png │ └── ozkary-convolutional-neural-network.png ├── 09-serverless ├── .gitignore ├── Dockerfile ├── Pipfile ├── Pipfile.lock ├── README.md ├── homework.ipynb ├── img_ai │ ├── __init__.py │ ├── __version__.py │ └── bees_wasps.py └── main.py ├── 10-kubernetes ├── README.md ├── homework.ipynb └── src │ ├── Dockerfile │ ├── Pipfile │ ├── Pipfile.lock │ ├── deployment.yaml │ ├── dv.bin │ ├── hpa.yaml │ ├── model1.bin │ ├── q3_test.py │ ├── q4_predict.py │ ├── q4_test.py │ ├── q6_predict.py │ ├── q6_test.py │ ├── q6_test_loop.py │ └── service.yaml ├── 11-kserve ├── README.md ├── homework.ipynb ├── iris_example.yaml └── quick_install.sh ├── CODE_OF_CONDUCT.md ├── CONTRIBUTING.md ├── LICENSE ├── README.md ├── data └── data.csv ├── getting_started.md ├── images ├── machine-learning-engineering.jpg └── ozkary-machine-learning-classification.png ├── projects ├── drug-drug-interaction │ ├── .gitignore │ ├── Dockerfile │ ├── Pipfile │ ├── Pipfile.lock │ ├── README.md │ ├── app.py │ ├── data │ │ ├── Approved_drug_Information.txt │ │ ├── drugbank_known_ddi.txt │ │ ├── drugbank_slim_df.csv │ │ ├── interaction_types.csv │ │ ├── pharma.7z │ │ └── test_cases.csv │ ├── data_analysis.ipynb │ ├── data_control.ipynb │ ├── data_predict.ipynb │ ├── data_test_api.ipynb │ ├── data_train.ipynb │ ├── data_train_mlp.ipynb │ ├── ddi_lib │ │ ├── __init__.py │ │ ├── __version__.py │ │ ├── data_predict.py │ │ ├── data_train.py │ │ └── data_train_mlp.py │ ├── ddi_lib_rel │ │ ├── __init__.py │ │ ├── __version__.py │ │ └── data_predict.py │ ├── images │ │ ├── ozkary-interaction-type-class-balance.png │ │ ├── ozkary-interaction-type-distribution.png │ │ ├── ozkary-ml-ddi-model-confusion-matrix.png │ │ ├── ozkary-ml-ddi-model-evaluation.png │ │ ├── ozkary-mlp-model3_accuracy.png │ │ ├── ozkary-mlp-model3_loss.png │ │ ├── ozkary-mlp-neural-network1.png │ │ ├── ozkary-mlp-neural-network2.png │ │ ├── ozkary-mlp-neural-network3.png │ │ ├── ozkary-pca-feature-importance.png │ │ └── ozkary-predicting-drug-drug-interactions-with-ai.jpg │ ├── models │ │ ├── ozkary-ddi.h5 │ │ ├── ozkary_ddi_encoder.pkl.bin │ │ └── ozkary_ddi_xgboost.pkl.bin │ ├── ozkary-ai-ddi │ │ ├── .gitignore │ │ ├── .vscode │ │ │ └── extensions.json │ │ ├── Dockerfile │ │ ├── azure-deploy.sh │ │ ├── getting_started.md │ │ ├── host.json │ │ ├── predict │ │ │ ├── Pipfile │ │ │ ├── Pipfile.lock │ │ │ ├── __init__.py │ │ │ ├── __version__.py │ │ │ ├── data │ │ │ │ └── interaction_types.csv │ │ │ ├── ddi_lib │ │ │ │ ├── __init__.py │ │ │ │ ├── __version__.py │ │ │ │ ├── data_predict.py │ │ │ │ ├── data_train.py │ │ │ │ └── data_train_mlp.py │ │ │ ├── function.json │ │ │ └── models │ │ │ │ ├── ozkary-ddi.h5 │ │ │ │ ├── ozkary_ddi_encoder.pkl.bin │ │ │ │ └── ozkary_ddi_xgboost.pkl.bin │ │ └── requirements.txt │ ├── requirements.txt │ └── serverless_deploy.sh ├── heart-disease-risk │ ├── .gitignore │ ├── Dockerfile │ ├── Pipfile │ ├── Pipfile.lock │ ├── README.md │ ├── app.py │ ├── bin │ │ ├── hd_dictvectorizer.pkl.bin │ │ └── hd_xgboost_model.pkl.bin │ ├── data │ │ └── test_cases.csv │ ├── data_analysis.ipynb │ ├── data_predict.ipynb │ ├── data_predict.py │ ├── data_processing.ipynb │ ├── data_test_api.ipynb │ ├── data_train.ipynb │ ├── data_train.py │ ├── fn-ai-ml-heart-disease │ │ ├── .gitignore │ │ ├── .vscode │ │ │ └── extensions.json │ │ ├── Pipfile │ │ ├── Pipfile.lock │ │ ├── azure-deploy.sh │ │ ├── getting_started.md │ │ ├── host.json │ │ ├── main.tf │ │ ├── predict │ │ │ ├── __init__.py │ │ │ ├── data_predict.py │ │ │ ├── function.json │ │ │ ├── hd_dictvectorizer.pkl │ │ │ └── hd_xgboost_model.pkl │ │ └── requirements.txt │ ├── heart_disease_model_base.py │ ├── heart_disease_model_factory.py │ ├── heart_disease_random_forest.py │ └── images │ │ ├── ozkary-ml-heart-disease-azure-function.png │ │ ├── ozkary-ml-heart-disease-class-balance.png │ │ ├── ozkary-ml-heart-disease-feature-analysis.png │ │ ├── ozkary-ml-heart-disease-feature-importance.png │ │ ├── ozkary-ml-heart-disease-model-confusion-matrix.png │ │ └── ozkary-ml-heart-disease-model-evaluation.png └── vehicle-msrp │ └── regression.ipynb └── voice-cloning └── README.md /.github/ISSUE_TEMPLATE/bug_report.md: -------------------------------------------------------------------------------- 1 | --- 2 | name: Bug report 3 | about: Create a report to help us improve 4 | title: '' 5 | labels: '' 6 | assignees: '' 7 | 8 | --- 9 | 10 | **Describe the bug** 11 | A clear and concise description of what the bug is. 12 | 13 | **To Reproduce** 14 | Steps to reproduce the behavior: 15 | 1. Go to '...' 16 | 2. Click on '....' 17 | 3. Scroll down to '....' 18 | 4. See error 19 | 20 | **Expected behavior** 21 | A clear and concise description of what you expected to happen. 22 | 23 | **Screenshots** 24 | If applicable, add screenshots to help explain your problem. 25 | 26 | **Desktop (please complete the following information):** 27 | - OS: [e.g. iOS] 28 | - Browser [e.g. chrome, safari] 29 | - Version [e.g. 22] 30 | 31 | **Smartphone (please complete the following information):** 32 | - Device: [e.g. iPhone6] 33 | - OS: [e.g. iOS8.1] 34 | - Browser [e.g. stock browser, safari] 35 | - Version [e.g. 22] 36 | 37 | **Additional context** 38 | Add any other context about the problem here. 39 | -------------------------------------------------------------------------------- /.github/ISSUE_TEMPLATE/feature_request.md: -------------------------------------------------------------------------------- 1 | --- 2 | name: Feature request 3 | about: Suggest an idea for this project 4 | title: '' 5 | labels: '' 6 | assignees: '' 7 | 8 | --- 9 | 10 | **Is your feature request related to a problem? Please describe.** 11 | A clear and concise description of what the problem is. Ex. I'm always frustrated when [...] 12 | 13 | **Describe the solution you'd like** 14 | A clear and concise description of what you want to happen. 15 | 16 | **Describe alternatives you've considered** 17 | A clear and concise description of any alternative solutions or features you've considered. 18 | 19 | **Additional context** 20 | Add any other context or screenshots about the feature request here. 21 | -------------------------------------------------------------------------------- /.github/ISSUE_TEMPLATE/feedback-or-question.md: -------------------------------------------------------------------------------- 1 | --- 2 | name: Feedback or Question 3 | about: Use this to send a feedback or question 4 | title: '' 5 | labels: '' 6 | assignees: '' 7 | 8 | --- 9 | 10 | ## Describe your question or feedback 11 | 12 | ### Add code snippet below 13 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | *$py.class 5 | 6 | # C extensions 7 | *.so 8 | 9 | # Distribution / packaging 10 | .Python 11 | build/ 12 | develop-eggs/ 13 | dist/ 14 | downloads/ 15 | eggs/ 16 | .eggs/ 17 | lib/ 18 | lib64/ 19 | parts/ 20 | sdist/ 21 | var/ 22 | wheels/ 23 | share/python-wheels/ 24 | *.egg-info/ 25 | .installed.cfg 26 | *.egg 27 | MANIFEST 28 | *.zip 29 | *.csv 30 | *.parquet 31 | # PyInstaller 32 | # Usually these files are written by a python script from a template 33 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 34 | *.manifest 35 | *.spec 36 | 37 | # Installer logs 38 | pip-log.txt 39 | pip-delete-this-directory.txt 40 | 41 | # Unit test / coverage reports 42 | htmlcov/ 43 | .tox/ 44 | .nox/ 45 | .coverage 46 | .coverage.* 47 | .cache 48 | nosetests.xml 49 | coverage.xml 50 | *.cover 51 | *.py,cover 52 | .hypothesis/ 53 | .pytest_cache/ 54 | cover/ 55 | 56 | # Translations 57 | *.mo 58 | *.pot 59 | 60 | # Django stuff: 61 | *.log 62 | local_settings.py 63 | db.sqlite3 64 | db.sqlite3-journal 65 | 66 | # Flask stuff: 67 | instance/ 68 | .webassets-cache 69 | 70 | # Scrapy stuff: 71 | .scrapy 72 | 73 | # Sphinx documentation 74 | docs/_build/ 75 | 76 | # PyBuilder 77 | .pybuilder/ 78 | target/ 79 | 80 | # Jupyter Notebook 81 | .ipynb_checkpoints 82 | 83 | # IPython 84 | profile_default/ 85 | ipython_config.py 86 | 87 | # pyenv 88 | # For a library or package, you might want to ignore these files since the code is 89 | # intended to run in multiple environments; otherwise, check them in: 90 | # .python-version 91 | 92 | # pipenv 93 | # According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control. 94 | # However, in case of collaboration, if having platform-specific dependencies or dependencies 95 | # having no cross-platform support, pipenv may install dependencies that don't work, or not 96 | # install all needed dependencies. 97 | #Pipfile.lock 98 | 99 | # poetry 100 | # Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control. 101 | # This is especially recommended for binary packages to ensure reproducibility, and is more 102 | # commonly ignored for libraries. 103 | # https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control 104 | #poetry.lock 105 | 106 | # pdm 107 | # Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control. 108 | #pdm.lock 109 | # pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it 110 | # in version control. 111 | # https://pdm.fming.dev/#use-with-ide 112 | .pdm.toml 113 | 114 | # PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm 115 | __pypackages__/ 116 | 117 | # Celery stuff 118 | celerybeat-schedule 119 | celerybeat.pid 120 | 121 | # SageMath parsed files 122 | *.sage.py 123 | 124 | # Environments 125 | .env 126 | .venv 127 | env/ 128 | venv/ 129 | ENV/ 130 | env.bak/ 131 | venv.bak/ 132 | 133 | # Spyder project settings 134 | .spyderproject 135 | .spyproject 136 | 137 | # Rope project settings 138 | .ropeproject 139 | 140 | # mkdocs documentation 141 | /site 142 | 143 | # mypy 144 | .mypy_cache/ 145 | .dmypy.json 146 | dmypy.json 147 | 148 | # Pyre type checker 149 | .pyre/ 150 | 151 | # pytype static type analyzer 152 | .pytype/ 153 | 154 | # Cython debug symbols 155 | cython_debug/ 156 | 157 | # PyCharm 158 | # JetBrains specific template is maintained in a separate JetBrains.gitignore that can 159 | # be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore 160 | # and can be added to the global gitignore or merged into this file. For a more nuclear 161 | # option (not recommended) you can uncomment the following to ignore the entire idea folder. 162 | #.idea/ 163 | 164 | # ignore the images *.jpg in the 08-deep-learning folder and subfolders 165 | 08-deep-learning/**/*.jpg 166 | *.h5git 167 | *.tflite 168 | 169 | projects/census 170 | projects/drug-drug-interaction/data/ssp_interaction_type.csv.all.gz 171 | 172 | *.bin -------------------------------------------------------------------------------- /01-introduction/comparison.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | " # ML impacts how computers solve problems. Traditional systems rely on pre-defined rules programmed by humans. This approach struggles with complexity and doesn't adapt to new information. In contrast, ML enables computers to learn directly from data, similar to how humans learn." 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "## Traditional Code\n" 15 | ] 16 | }, 17 | { 18 | "cell_type": "code", 19 | "execution_count": 1, 20 | "metadata": {}, 21 | "outputs": [], 22 | "source": [ 23 | "def heart_disease_risk_rule_based(age, overweight, diabetic):\n", 24 | " \"\"\"\n", 25 | " Assesses heart disease risk based on a set of predefined rules.\n", 26 | "\n", 27 | " Args:\n", 28 | " age: Age of the individual (int).\n", 29 | " overweight: True if overweight, False otherwise (bool).\n", 30 | " diabetic: True if diabetic, False otherwise (bool).\n", 31 | "\n", 32 | " Returns:\n", 33 | " \"High Risk\" or \"Low Risk\" (str).\n", 34 | " \"\"\"\n", 35 | " if age > 50 and overweight and diabetic:\n", 36 | " return \"High Risk\"\n", 37 | " elif age > 60 and (overweight or diabetic):\n", 38 | " return \"High Risk\"\n", 39 | " elif age > 40 and overweight and not diabetic:\n", 40 | " return \"Moderate Risk\"\n", 41 | " else:\n", 42 | " return \"Low Risk\"" 43 | ] 44 | }, 45 | { 46 | "cell_type": "code", 47 | "execution_count": 2, 48 | "metadata": {}, 49 | "outputs": [ 50 | { 51 | "name": "stdout", 52 | "output_type": "stream", 53 | "text": [ 54 | "High Risk\n", 55 | "Low Risk\n", 56 | "High Risk\n", 57 | "Moderate Risk\n" 58 | ] 59 | } 60 | ], 61 | "source": [ 62 | " # Examples\n", 63 | " print(heart_disease_risk_rule_based(55, True, True)) # Output: High Risk\n", 64 | " print(heart_disease_risk_rule_based(45, False, False)) # Output: Low Risk\n", 65 | " print(heart_disease_risk_rule_based(65, False, True)) # Output: High Risk\n", 66 | " print(heart_disease_risk_rule_based(45, True, False)) # Output: Moderate Risk" 67 | ] 68 | }, 69 | { 70 | "cell_type": "markdown", 71 | "metadata": {}, 72 | "source": [ 73 | "# Machine Learning (Data-Driven Approach)" 74 | ] 75 | }, 76 | { 77 | "cell_type": "code", 78 | "execution_count": 7, 79 | "metadata": {}, 80 | "outputs": [ 81 | { 82 | "name": "stdout", 83 | "output_type": "stream", 84 | "text": [ 85 | " Age Overweight Diabetic Heart Disease\n", 86 | "0 69 False True Yes\n", 87 | "1 43 True True No\n", 88 | "2 72 False False No\n", 89 | "3 58 True False Yes\n", 90 | "4 70 False False Yes\n", 91 | ".. ... ... ... ...\n", 92 | "95 43 False False No\n", 93 | "96 31 True True No\n", 94 | "97 45 False True No\n", 95 | "98 48 False False No\n", 96 | "99 70 True False Yes\n", 97 | "\n", 98 | "[100 rows x 4 columns]\n" 99 | ] 100 | } 101 | ], 102 | "source": [ 103 | "# get the data\n", 104 | "import random\n", 105 | "import pandas as pd\n", 106 | "\n", 107 | "def generate_heart_disease_data(num_records=50):\n", 108 | " \"\"\"\n", 109 | " Generates synthetic data for heart disease risk assessment.\n", 110 | "\n", 111 | " Args:\n", 112 | " num_records: The number of data records to generate.\n", 113 | "\n", 114 | " Returns:\n", 115 | " A pandas DataFrame containing the generated data.\n", 116 | " \"\"\"\n", 117 | "\n", 118 | " data = {\n", 119 | " 'Age': [],\n", 120 | " 'Overweight': [],\n", 121 | " 'Diabetic': [],\n", 122 | " 'Heart Disease': []\n", 123 | " }\n", 124 | "\n", 125 | " for _ in range(num_records):\n", 126 | " age = random.randint(30, 80) # Assuming age range of 30-80\n", 127 | " overweight = random.choice([True, False])\n", 128 | " diabetic = random.choice([True, False])\n", 129 | "\n", 130 | " # Introduce some logic for heart disease risk based on factors\n", 131 | " if age > 60 and (overweight or diabetic):\n", 132 | " heart_disease = random.choices(['Yes', 'No'], weights=[0.8, 0.2])[0] # Higher chance of Yes\n", 133 | " elif age > 50 and overweight and diabetic:\n", 134 | " heart_disease = random.choices(['Yes', 'No'], weights=[0.7, 0.3])[0]\n", 135 | " elif age > 40 and overweight and not diabetic:\n", 136 | " heart_disease = random.choices(['Yes', 'No'], weights=[0.3, 0.7])[0] # Lower chance of Yes\n", 137 | " else:\n", 138 | " heart_disease = random.choices(['Yes', 'No'], weights=[0.1, 0.9])[0] # Low chance of Yes\n", 139 | "\n", 140 | " data['Age'].append(age)\n", 141 | " data['Overweight'].append(overweight)\n", 142 | " data['Diabetic'].append(diabetic)\n", 143 | " data['Heart Disease'].append(heart_disease)\n", 144 | "\n", 145 | " return pd.DataFrame(data)\n", 146 | "\n", 147 | "\n", 148 | "# Create a sample dataframe\n", 149 | "data = generate_heart_disease_data(100)\n", 150 | "print(data.head(100))" 151 | ] 152 | }, 153 | { 154 | "cell_type": "code", 155 | "execution_count": null, 156 | "metadata": {}, 157 | "outputs": [ 158 | { 159 | "name": "stdout", 160 | "output_type": "stream", 161 | "text": [ 162 | "Accuracy of the model: 0.75\n" 163 | ] 164 | } 165 | ], 166 | "source": [ 167 | "\n", 168 | "from sklearn.model_selection import train_test_split\n", 169 | "from sklearn.ensemble import RandomForestClassifier\n", 170 | "from sklearn.metrics import accuracy_score\n", 171 | "\n", 172 | "\n", 173 | "df = pd.DataFrame(data)\n", 174 | "# Prepare the data\n", 175 | "X = df[['Age', 'Overweight', 'Diabetic']] # Features\n", 176 | "y = df['Heart Disease'] # Target\n", 177 | "\n", 178 | "# Split data into training and testing sets\n", 179 | "# X has the categories/features\n", 180 | "# y has the target value\n", 181 | "# train data is for training\n", 182 | "# test data is for testing\n", 183 | "# .2 means 20% of the data is used for testing 80% for training\n", 184 | "# 42 is the seed for random shuffling\n", 185 | "\n", 186 | "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)\n", 187 | "\n", 188 | "# Train a Random Forest classifier\n", 189 | "model = RandomForestClassifier()\n", 190 | "model.fit(X_train, y_train)\n", 191 | "\n", 192 | "# Make predictions on the test set\n", 193 | "y_pred = model.predict(X_test)\n", 194 | "\n", 195 | "# Evaluate the model\n", 196 | "accuracy = accuracy_score(y_test, y_pred)\n", 197 | "print(f\"Accuracy of the model : {accuracy}\")\n", 198 | "\n", 199 | "# 70% - 80%: Often considered a reasonable starting point for many classification problems.\n", 200 | "# 80% - 90%: Good performance for many applications.\n", 201 | "# 90% - 95%: Very good performance. Often challenging to achieve, but possible for well-behaved problems with good data.\n", 202 | "# > 95%: Excellent performance, potentially approaching the limits of what's possible for the problem. Be careful of overfitting if you're achieving very high accuracy.\n", 203 | "# 100%: Usually a sign of overfitting." 204 | ] 205 | }, 206 | { 207 | "cell_type": "code", 208 | "execution_count": 11, 209 | "metadata": {}, 210 | "outputs": [ 211 | { 212 | "name": "stdout", 213 | "output_type": "stream", 214 | "text": [ 215 | "Prediction on the new data ['Yes']\n" 216 | ] 217 | } 218 | ], 219 | "source": [ 220 | "# New Data prediction\n", 221 | "new_data = pd.DataFrame({\n", 222 | " 'Age': [55],\n", 223 | " 'Overweight': [True],\n", 224 | " 'Diabetic': [True]\n", 225 | "})\n", 226 | "\n", 227 | "prediction = model.predict(new_data)\n", 228 | "print(f\"Prediction on the new data {prediction}\")" 229 | ] 230 | } 231 | ], 232 | "metadata": { 233 | "kernelspec": { 234 | "display_name": "Python 3", 235 | "language": "python", 236 | "name": "python3" 237 | }, 238 | "language_info": { 239 | "codemirror_mode": { 240 | "name": "ipython", 241 | "version": 3 242 | }, 243 | "file_extension": ".py", 244 | "mimetype": "text/x-python", 245 | "name": "python", 246 | "nbconvert_exporter": "python", 247 | "pygments_lexer": "ipython3", 248 | "version": "3.8.10" 249 | } 250 | }, 251 | "nbformat": 4, 252 | "nbformat_minor": 2 253 | } 254 | -------------------------------------------------------------------------------- /01-introduction/homework.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Introduction to Machine Learning\n", 8 | "\n", 9 | "## Set up the development environment\n", 10 | "\n", 11 | "Install the following packages using pip:\n", 12 | "\n", 13 | "```bash\n", 14 | "pip install numpy pandas matplotlib seaborn scikit-learn\n", 15 | "```\n", 16 | "## Homework 1" 17 | ] 18 | }, 19 | { 20 | "cell_type": "markdown", 21 | "metadata": {}, 22 | "source": [ 23 | "### Question 1: What's the version of Pandas that you installed?\n", 24 | "\n", 25 | "You can get the version information using the __version__ field:" 26 | ] 27 | }, 28 | { 29 | "cell_type": "code", 30 | "execution_count": 2, 31 | "metadata": {}, 32 | "outputs": [ 33 | { 34 | "data": { 35 | "text/plain": [ 36 | "'1.5.2'" 37 | ] 38 | }, 39 | "execution_count": 2, 40 | "metadata": {}, 41 | "output_type": "execute_result" 42 | } 43 | ], 44 | "source": [ 45 | "import pandas as pd\n", 46 | "pd.__version__" 47 | ] 48 | }, 49 | { 50 | "cell_type": "markdown", 51 | "metadata": {}, 52 | "source": [ 53 | " - Download the sample data\n", 54 | "```bash\n", 55 | "wget https://raw.githubusercontent.com/alexeygrigorev/datasets/master/housing.csv\n", 56 | "```" 57 | ] 58 | }, 59 | { 60 | "cell_type": "markdown", 61 | "metadata": {}, 62 | "source": [ 63 | "### Question 2: How many columns are in the dataset?" 64 | ] 65 | }, 66 | { 67 | "cell_type": "code", 68 | "execution_count": 3, 69 | "metadata": {}, 70 | "outputs": [ 71 | { 72 | "name": "stdout", 73 | "output_type": "stream", 74 | "text": [ 75 | "Number of columns: 10\n" 76 | ] 77 | } 78 | ], 79 | "source": [ 80 | "df = pd.read_csv('housing.csv', iterator=False)\n", 81 | "\n", 82 | "# Get the number of columns\n", 83 | "num_columns = df.shape[1]\n", 84 | "print('Number of columns:', num_columns)" 85 | ] 86 | }, 87 | { 88 | "cell_type": "markdown", 89 | "metadata": {}, 90 | "source": [ 91 | "### Question 3: Which columns in the dataset have missing values?" 92 | ] 93 | }, 94 | { 95 | "cell_type": "code", 96 | "execution_count": 7, 97 | "metadata": {}, 98 | "outputs": [ 99 | { 100 | "name": "stdout", 101 | "output_type": "stream", 102 | "text": [ 103 | "Missing data in each column:\n", 104 | "longitude 0\n", 105 | "latitude 0\n", 106 | "housing_median_age 0\n", 107 | "total_rooms 0\n", 108 | "total_bedrooms 207\n", 109 | "population 0\n", 110 | "households 0\n", 111 | "median_income 0\n", 112 | "median_house_value 0\n", 113 | "ocean_proximity 0\n", 114 | "dtype: int64\n" 115 | ] 116 | } 117 | ], 118 | "source": [ 119 | "# Check for missing data in each column\n", 120 | "missing_data = df.isnull().sum()\n", 121 | "\n", 122 | "print('Missing data in each column:')\n", 123 | "print(missing_data)" 124 | ] 125 | }, 126 | { 127 | "cell_type": "markdown", 128 | "metadata": {}, 129 | "source": [ 130 | "### Question 4: How many unique values does the ocean_proximity column have?" 131 | ] 132 | }, 133 | { 134 | "cell_type": "code", 135 | "execution_count": 5, 136 | "metadata": {}, 137 | "outputs": [ 138 | { 139 | "name": "stdout", 140 | "output_type": "stream", 141 | "text": [ 142 | "Unique values of the column \"ocean_proximity\":\n", 143 | "['NEAR BAY' '<1H OCEAN' 'INLAND' 'NEAR OCEAN' 'ISLAND']\n", 144 | "Number of unique values of the column \"ocean_proximity\": 5\n" 145 | ] 146 | } 147 | ], 148 | "source": [ 149 | "# get the unique values of the column 'ocean_proximity'\n", 150 | "unique_values = df['ocean_proximity'].unique()\n", 151 | "print('Unique values of the column \"ocean_proximity\":')\n", 152 | "print(unique_values)\n", 153 | "\n", 154 | "# get the number of unique values of the column 'ocean_proximity'\n", 155 | "print('Number of unique values of the column \"ocean_proximity\":', len(unique_values))\n" 156 | ] 157 | }, 158 | { 159 | "cell_type": "markdown", 160 | "metadata": {}, 161 | "source": [ 162 | "### Question 5: What's the average value of the median_house_value for the houses located near the bay?" 163 | ] 164 | }, 165 | { 166 | "cell_type": "code", 167 | "execution_count": 11, 168 | "metadata": {}, 169 | "outputs": [ 170 | { 171 | "name": "stdout", 172 | "output_type": "stream", 173 | "text": [ 174 | "Average value of the column median_house_value for house located near the bay:\n", 175 | "259212.31\n" 176 | ] 177 | } 178 | ], 179 | "source": [ 180 | "# get the average value of the column median_house_value for houses located near the bay\n", 181 | "bay_area = df[df['ocean_proximity'] == 'NEAR BAY']\n", 182 | "average_value = bay_area['median_house_value'].mean()\n", 183 | "print('Average value of the column median_house_value for house located near the bay:')\n", 184 | "# format the output to 2 decimal places\n", 185 | "print(f'{average_value:.2f}')\n" 186 | ] 187 | }, 188 | { 189 | "cell_type": "markdown", 190 | "metadata": {}, 191 | "source": [ 192 | "### Question 6: Has the mean value changed after filling missing values?" 193 | ] 194 | }, 195 | { 196 | "cell_type": "code", 197 | "execution_count": 17, 198 | "metadata": {}, 199 | "outputs": [ 200 | { 201 | "name": "stdout", 202 | "output_type": "stream", 203 | "text": [ 204 | "Average of total_bedrooms column in the dataset: 537.871\n", 205 | "Average of total_bedrooms column in the dataset after filling missing values: 537.871\n", 206 | "Average has changed after filling the missing values: False\n", 207 | "Average has changed after filling the missing values: 0.000\n" 208 | ] 209 | } 210 | ], 211 | "source": [ 212 | "# Calculate the average of total_bedrooms column in the dataset\n", 213 | "average_bedrooms = df['total_bedrooms'].mean()\n", 214 | "print(f'Average of total_bedrooms column in the dataset: {average_bedrooms:.3f}')\n", 215 | "\n", 216 | "# Use the fillna method to fill the missing values in total_bedrooms with the mean value from the previous step\n", 217 | "df['total_bedrooms'].fillna(average_bedrooms, inplace=True)\n", 218 | "\n", 219 | "# Calculate the average of total_bedrooms column in the dataset again\n", 220 | "updated_average_bedrooms = df['total_bedrooms'].mean()\n", 221 | "print(f'Average of total_bedrooms column in the dataset after filling missing values: {updated_average_bedrooms:.3f}')\n", 222 | "\n", 223 | "# has the average changed after filling the missing values?\n", 224 | "print(f'Average has changed after filling the missing values: {average_bedrooms != updated_average_bedrooms}')\n", 225 | "\n", 226 | "# round the two averages to 3 decimal places and compare if the value has changed\n", 227 | "print(f'Average has changed after filling the missing values: {average_bedrooms != updated_average_bedrooms:.3f}')\n" 228 | ] 229 | }, 230 | { 231 | "cell_type": "markdown", 232 | "metadata": {}, 233 | "source": [ 234 | "### Question 7: Value of the last element of w" 235 | ] 236 | }, 237 | { 238 | "cell_type": "code", 239 | "execution_count": 26, 240 | "metadata": {}, 241 | "outputs": [ 242 | { 243 | "name": "stdout", 244 | "output_type": "stream", 245 | "text": [ 246 | "w: [23.12330961 -1.48124183 5.69922946]\n", 247 | "Value of the last element of w: 5.6992\n" 248 | ] 249 | } 250 | ], 251 | "source": [ 252 | "# Select all the options located on islands\n", 253 | "island_options = df[df['ocean_proximity'] == 'ISLAND']\n", 254 | "\n", 255 | "# Select only columns housing_median_age, total_rooms, total_bedrooms\n", 256 | "island_options = island_options[['housing_median_age', 'total_rooms', 'total_bedrooms']]\n", 257 | "\n", 258 | "# Get the underlying NumPy array. Let's call it X\n", 259 | "X = island_options.values\n", 260 | "\n", 261 | "# Compute matrix-matrix multiplication between the transpose of X and X. To get the transpose, use X.T. Let's call the result XTX\n", 262 | "XTX = X.T.dot(X)\n", 263 | "\n", 264 | "# Compute the inverse of XTX. Let's call the result inv_XTX\n", 265 | "from numpy.linalg import inv\n", 266 | "inv_XTX = inv(XTX)\n", 267 | "\n", 268 | "# Create an array y with values [950, 1300, 800, 1000, 1300]\n", 269 | "y = [950, 1300, 800, 1000, 1300]\n", 270 | "\n", 271 | "# Multiply the inverse of XTX with the transpose of X, and then multiply the result by y. Call the result w\n", 272 | "w = inv_XTX.dot(X.T).dot(y)\n", 273 | "print('w:', w)\n", 274 | "\n", 275 | "# What's the value of the last element of w?\n", 276 | "print(f'Value of the last element of w: {w[-1]:.4f}')\n", 277 | "\n" 278 | ] 279 | } 280 | ], 281 | "metadata": { 282 | "kernelspec": { 283 | "display_name": "Python 3 (ipykernel)", 284 | "language": "python", 285 | "name": "python3" 286 | }, 287 | "language_info": { 288 | "codemirror_mode": { 289 | "name": "ipython", 290 | "version": 3 291 | }, 292 | "file_extension": ".py", 293 | "mimetype": "text/x-python", 294 | "name": "python", 295 | "nbconvert_exporter": "python", 296 | "pygments_lexer": "ipython3", 297 | "version": "3.8.10" 298 | }, 299 | "orig_nbformat": 4 300 | }, 301 | "nbformat": 4, 302 | "nbformat_minor": 2 303 | } 304 | -------------------------------------------------------------------------------- /02-regression/README.md: -------------------------------------------------------------------------------- 1 | # Machine Learning (ML) Regression 2 | 3 | Regression is a fundamental technique in machine learning used for predicting continuous outcomes. It's widely used in various domains, such as finance, healthcare, and economics, to forecast trends, analyze relationships, and make predictions based on input variables. 4 | 5 | In regression, the goal is to find the best-fitting line or curve that describes the relationship between input features (independent variables) and the target variable (dependent variable). Linear regression is one of the simplest and widely used regression techniques, aiming to fit a linear equation to the data. 6 | 7 | ## Regression Using Linear Regression: A Step-by-Step Process 8 | 9 | ### 1. Prepare the Data and Perform Exploratory Data Analysis (EDA): 10 | - Clean the data, handle missing values, and preprocess features. 11 | - Explore and understand the data through statistical analysis, visualization, and summary statistics. 12 | 13 | ### 2. Use Linear Regression to Predict the Target (Price in this example): 14 | - Select the features (independent variables) and the target variable (e.g., house price). 15 | - Split the data into training and testing sets. 16 | - Train a linear regression model on the training data. 17 | 18 | ### 3. Internal Workings of Linear Regression: 19 | - Linear regression fits a line (in simple linear regression) or a hyperplane (in multiple linear regression) to minimize the sum of squared differences between the observed and predicted values. 20 | - It uses techniques like Ordinary Least Squares (OLS) to estimate the model parameters (coefficients) that define the line. 21 | 22 | ### 4. Evaluate the Model using Root Mean Squared Error (RMSE): 23 | - RMSE measures the average error between the observed and predicted values. 24 | - Lower RMSE indicates a better fit of the model to the data. 25 | 26 | ### 5. Feature Engineering: 27 | - Enhance the model's predictive power by creating new features or transforming existing ones. 28 | - Feature engineering may involve scaling, binning, one-hot encoding, or extracting useful information from raw data. 29 | 30 | ### 6. Regularization (Optional): 31 | - Implement regularization techniques like Lasso (L1) or Ridge (L2) regression to prevent overfitting and improve model generalization. 32 | - Regularization adds a penalty to the model parameters to avoid excessively large coefficients. 33 | 34 | ### 7. Use the Model for Predictions: 35 | - Apply the trained model to make predictions on new, unseen data. 36 | - Use the model to forecast prices based on the chosen features. 37 | 38 | By following this step-by-step process, you can effectively use linear regression for prediction, understand its internal mechanisms, evaluate its performance, and enhance it through feature engineering and regularization. 39 | 40 | ## Example: Predicting House Prices 41 | 42 | Here's a Python example for each step of the process using a simple linear regression to predict house prices: 43 | 44 | ### Prepare the Data and Perform Exploratory Data Analysis (EDA): 45 | 46 | ```python 47 | import pandas as pd 48 | import seaborn as sns 49 | import matplotlib.pyplot as plt 50 | 51 | # Load and explore the dataset 52 | df = pd.read_csv('housing.csv') 53 | 54 | # Display the first few rows of the dataset 55 | print(df.head()) 56 | 57 | # Visualize data 58 | sns.pairplot(df, x_vars=['area', 'bedrooms'], y_vars='price', height=5, aspect=1) 59 | plt.show() 60 | ``` 61 | 62 | ### Use Linear Regression to Predict the Target (Price in this example): 63 | 64 | ```python 65 | from sklearn.model_selection import train_test_split 66 | from sklearn.linear_model import LinearRegression 67 | from sklearn.metrics import mean_squared_error 68 | 69 | # Prepare the data 70 | X = df[['area', 'bedrooms']] 71 | y = df['price'] 72 | 73 | # Split the data into training and testing sets 74 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) 75 | 76 | # Train the linear regression model 77 | model = LinearRegression() 78 | model.fit(X_train, y_train) 79 | ``` 80 | 81 | ### Evaluate the Model using Root Mean Squared Error (RMSE): 82 | 83 | ```python 84 | # Predict using the model 85 | y_pred = model.predict(X_test) 86 | 87 | # Evaluate the model 88 | rmse = mean_squared_error(y_test, y_pred, squared=False) 89 | print('Root Mean Squared Error:', rmse) 90 | ``` 91 | 92 | ### Feature Engineering: 93 | 94 | In this simple example, we're using the existing features 'area' and 'bedrooms'. 95 | 96 | ### Regularization (Optional): 97 | 98 | Regularization helps prevent overfitting by adding a penalty term to the model parameters (coefficients) during the training process. This penalty discourages overly complex models with large coefficients. In linear regression, two common types of regularization are Lasso (L1 regularization) and Ridge (L2 regularization). 99 | 100 | Here's how you could apply Ridge regularization to the linear regression model: 101 | 102 | ```python 103 | from sklearn.linear_model import Ridge 104 | 105 | # Train the Ridge regression model with regularization (alpha is the regularization parameter) 106 | ridge_model = Ridge(alpha=1.0) # You can adjust the alpha value 107 | ridge_model.fit(X_train, y_train) 108 | 109 | # Evaluate the Ridge model 110 | ridge_y_pred = ridge_model.predict(X_test) 111 | ridge_rmse = mean_squared_error(y_test, ridge_y_pred, squared=False) 112 | print('Ridge Regression RMSE:', ridge_rmse) 113 | ``` 114 | 115 | In a real-world scenario, when dealing with more complex data or when you observe overfitting in your model, you would typically experiment with both Lasso and Ridge regularization techniques to find an optimal value for the regularization parameter (alpha) that balances model complexity and performance. Regularization is a crucial tool in your toolkit for robust and stable model training in machine learning. 116 | 117 | ### Use the Model for Predictions: 118 | 119 | ```python 120 | # Predict house prices for new data 121 | new_data = pd.DataFrame({'area': [1500, 2000], 'bedrooms': [3, 4]}) 122 | predicted_prices = model.predict(new_data) 123 | print('Predicted Prices:', predicted_prices) 124 | ``` 125 | 126 | Use the trained linear regression model to predict house prices for new, unseen data. Let's break down the code step by step: 127 | 128 | **Creating New Data:** 129 | ```python 130 | new_data = pd.DataFrame({'area': [1500, 2000], 'bedrooms': [3, 4]}) 131 | ``` 132 | - Here, a new DataFrame named `new_data` is created using pandas. It has two rows, each representing a new house's features (area and bedrooms). For demonstration purposes, we have two hypothetical houses with different areas and bedrooms. 133 | 134 | **Predicting House Prices:** 135 | ```python 136 | predicted_prices = model.predict(new_data) 137 | ``` 138 | - The `predict()` method of the trained linear regression model (`model`) is used to predict the house prices for the new data (`new_data`). The `predict()` method takes the features of the new data and returns the predicted prices based on the trained model. 139 | 140 | **Printing Predicted Prices:** 141 | ```python 142 | print('Predicted Prices:', predicted_prices) 143 | ``` 144 | - Finally, the predicted house prices are printed to the console. The `predicted_prices` variable holds the predicted prices for the new houses based on the features provided in `new_data`. 145 | 146 | This example demonstrates a basic implementation of linear regression for predicting house prices using two features: 'area' and 'bedrooms'. -------------------------------------------------------------------------------- /03-classification/README.md: -------------------------------------------------------------------------------- 1 | # Machine Learning (ML) Classification 2 | 3 | Classification is a fundamental task in supervised machine learning where the goal is to predict the categorical class or label of a given data point based on its features. In other words, it involves assigning a predefined category to each input instance based on its characteristics. 4 | 5 | ## Key Components: 6 | 7 | 1. **Features (Predictors):** 8 | - These are the measurable characteristics or attributes of the data, often represented as variables. In a classification task for vehicles, features could include make, model, year, engine horsepower, etc. 9 | 10 | 2. **Target Variable (Class or Label):** 11 | - This is the variable we aim to predict or classify. It consists of discrete categories or labels, such as vehicle types (e.g., sedan, SUV, truck). 12 | 13 | 3. **Model:** 14 | - The machine learning algorithm or model learns patterns and relationships from the features to predict the target variable's class. Common models for classification include Logistic Regression, Decision Trees, Support Vector Machines, etc. 15 | 16 | 4. **Training:** 17 | - During the training phase, the model learns the patterns in the data by adjusting its parameters based on the features and known target labels. 18 | 19 | 5. **Testing and Evaluation:** 20 | - The model's performance is evaluated on a separate dataset (test set) to assess its ability to generalize to new, unseen data. Evaluation metrics like accuracy, precision, recall, and F1-score are used. 21 | 22 | ## Applications: 23 | - Classification has a wide range of applications, including spam detection in emails, sentiment analysis in text, disease diagnosis in healthcare, credit risk assessment in finance, image recognition, and much more. 24 | 25 | ## Process Overview: 26 | 27 | The steps in the classification process can be grouped into several distinct stages based on their related tasks and objectives. Here's a breakdown of the process into different stages: 28 | 29 | ![Machine Learning Engineering Classification](../images/ozkary-machine-learning-classification.png "Machine Learning Engineering Classification") 30 | 31 | ### Stage 1: **Data Preprocessing** 32 | - Start 33 | - Data Preprocessing 34 | - Handle missing values 35 | - Perform feature engineering (if needed) 36 | - Encode categorical features 37 | - Split the data into training and testing sets 38 | 39 | ### Stage 2: **Model Development** 40 | - Model Selection 41 | - Choose an appropriate classification algorithm (e.g., Logistic Regression, Decision Trees, etc.) 42 | - Model Training 43 | - Train the selected model using the training set 44 | 45 | ### Stage 3: **Model Evaluation and Optimization** 46 | - Model Evaluation 47 | - Evaluate the model's performance using validation/testing data 48 | - Calculate evaluation metrics (accuracy, precision, recall, etc.) 49 | - Hyperparameter Tuning 50 | - Optimize the model's hyperparameters for better performance 51 | 52 | ### Stage 4: **Deployment** 53 | - Final Model 54 | - Select the final, optimized model 55 | - Model Deployment 56 | - Deploy the model for making predictions on new, unseen data 57 | 58 | ### Stage 5: **Completion** 59 | - End (Classification process is complete) 60 | 61 | This grouping helps in organizing and understanding the workflow of the classification process more effectively. Each stage focuses on specific tasks related to data preparation, model development, evaluation, and deployment, contributing to the overall classification process. 62 | 63 | Classification plays a vital role in various real-world scenarios, aiding in decision-making and providing valuable insights based on the analysis of different classes or categories. 64 | 65 | ## Use Case with Vehicle Data 66 | 67 | ### 1. **Understanding the Dataset:** 68 | Start by exploring and understanding the dataset. Familiarize yourself with the features, their data types, distributions, and the target variable (in this case, the vehicle category we want to predict). 69 | 70 | ### 2. **Data Preprocessing:** 71 | - **Handling Missing Values:** Check for missing or null values in the dataset and decide on a strategy to handle them, such as imputation or removal. 72 | - **Feature Engineering:** Extract relevant features from the given ones and create any additional features that could enhance predictive performance. 73 | - **Categorical Encoding:** Encode categorical features like "Make," "Model," and "Transmission Type" into numerical values using techniques like one-hot encoding or label encoding. 74 | - **Data Split:** Split the dataset into training (60%), validation (20%), and testing (20%) sets. 75 | 76 | ### 3. **Model Selection:** 77 | Choose an appropriate classification algorithm based on the problem. Here, we'll use Logistic Regression for classification. 78 | 79 | ### 4. **Model Training:** 80 | - Fit the Logistic Regression model on the training dataset. 81 | 82 | ### 5. **Model Evaluation:** 83 | - Evaluate the model's performance on the validation set using appropriate evaluation metrics for classification. 84 | 85 | ### 6. **Hyperparameter Tuning:** 86 | - Optimize the model's hyperparameters to achieve better performance. 87 | 88 | ### 7. **Model Deployment:** 89 | - Once you're satisfied with the model's performance, deploy it to make predictions on new, unseen data. 90 | 91 | ### Example Code (using Python and scikit-learn): 92 | 93 | ```python 94 | import pandas as pd 95 | from sklearn.model_selection import train_test_split 96 | from sklearn.linear_model import LogisticRegression 97 | from sklearn.metrics import accuracy_score, classification_report 98 | 99 | # Load the dataset 100 | data = pd.read_csv('vehicle_data.csv') 101 | 102 | # Data preprocessing (handle missing values, feature engineering, categorical encoding) 103 | 104 | # Split the dataset into features (X) and target (y) 105 | X = data.drop('VehicleCategory', axis=1) # Features 106 | y = data['VehicleCategory'] # Target 107 | 108 | # Split the data into training (60%), validation (20%), and testing (20%) sets 109 | X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.4, random_state=42) 110 | X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42) 111 | 112 | # Model selection and training 113 | model = LogisticRegression(random_state=42) 114 | model.fit(X_train, y_train) 115 | 116 | # Model evaluation on the validation set 117 | y_val_pred = model.predict(X_val) 118 | val_accuracy = accuracy_score(y_val, y_val_pred) 119 | print('Validation Accuracy:', val_accuracy) 120 | print('Validation Classification Report:\n', classification_report(y_val, y_val_pred)) 121 | 122 | # Hyperparameter tuning (if needed) 123 | 124 | # Model evaluation on the test set 125 | y_test_pred = model.predict(X_test) 126 | test_accuracy = accuracy_score(y_test, y_test_pred) 127 | print('Test Accuracy:', test_accuracy) 128 | print('Test Classification Report:\n', classification_report(y_test, y_test_pred)) 129 | ``` 130 | 131 | This example uses Logistic Regression for the classification task and follows a 60%/20%/20% data split distribution. We train the model, evaluate it on the validation set, and then evaluate the final performance on the test set. Adjust the code and model based on your dataset and specific classification requirements. 132 | 133 | ## Use Case Summary 134 | 135 | In the last code examples, we're performing a classification task using Logistic Regression, a commonly used algorithm for binary and multiclass classification. 136 | 137 | ### Goal of the Classification Task: 138 | 139 | The goal of this classification task is to predict the category or class of vehicles based on their features. Each vehicle in the dataset has associated features such as make, model, year, engine HP, engine cylinders, etc. The target variable (what we want to predict) is the category of the vehicle, for instance, "sedan," "SUV," "truck," etc. 140 | 141 | ### Expected Result: 142 | 143 | Given the features of a vehicle (e.g., make, model, year, etc.), the trained Logistic Regression model will predict the class or category to which the vehicle belongs. For example, if the features of a given vehicle correspond to those of an "SUV," the model should predict the class label as "SUV." 144 | 145 | ### Evaluation: 146 | 147 | We split the data into training, validation, and test sets. We train the Logistic Regression model using the training set, validate its performance on the validation set, and then evaluate its final performance on the test set using evaluation metrics such as accuracy, precision, recall, and F1-score. These metrics will provide insights into how well the model can predict the correct vehicle categories. 148 | 149 | ### Practical Use: 150 | 151 | In real-world applications, this classification model could be used by various stakeholders, such as dealerships, insurance companies, or consumers, to automatically categorize vehicles based on their features, aiding in inventory management, insurance assessments, or buying decisions. -------------------------------------------------------------------------------- /04-evaluation/README.md: -------------------------------------------------------------------------------- 1 | # Machine Learning (ML) Evaluation 2 | 3 | Evaluating a machine learning model is a critical step to ensure its performance and reliability in making predictions or classifications. The evaluation process helps you understand how well your model generalizes to unseen data and whether it's meeting the desired objectives. Here's an overview of the evaluation process and its purpose: 4 | 5 | ### Evaluation Process: 6 | 7 | 1. **Data Splitting:** 8 | - **Purpose:** Divide the dataset into training and testing/validation sets. 9 | - **Explanation:** This helps in training the model on one set of data (training) and assessing its performance on another (testing/validation) to mimic real-world scenarios. 10 | 11 | 2. **Model Training:** 12 | - **Purpose:** Train the machine learning model using the training set. 13 | - **Explanation:** The model learns patterns and relationships from the training data to make predictions or classifications. 14 | 15 | 3. **Model Validation:** 16 | - **Purpose:** Validate the model's performance on a separate dataset (validation set). 17 | - **Explanation:** Helps fine-tune the model's hyperparameters and check for overfitting or underfitting. 18 | 19 | 4. **Model Testing:** 20 | - **Purpose:** Assess the model's performance on a completely unseen dataset (test set). 21 | - **Explanation:** Gives a final evaluation of the model's capability to generalize to new, unseen data. 22 | 23 | 5. **Evaluation Metrics Calculation:** 24 | - **Purpose:** Calculate various metrics to assess the model's performance. 25 | - **Explanation:** Metrics like accuracy, precision, recall, F1-score, mean squared error (MSE), root mean squared error (RMSE), etc., help quantify how well the model is performing. 26 | 27 | ### Purpose of Evaluation: 28 | 29 | 1. **Assess Model Performance:** 30 | - Understand how well the model performs on unseen data. 31 | 32 | 2. **Identify Overfitting or Underfitting:** 33 | - Determine if the model is too complex (overfitting) or too simple (underfitting) for the data. 34 | 35 | 3. **Hyperparameter Tuning:** 36 | - Adjust model settings (hyperparameters) to achieve better performance. 37 | 38 | 4. **Select the Best Model:** 39 | - Compare multiple models to choose the most effective one for the specific task. 40 | 41 | 5. **Optimize for Objectives:** 42 | - Optimize the model to meet specific goals (e.g., accuracy, precision, etc.). 43 | 44 | 6. **Ensure Generalization:** 45 | - Confirm that the model generalizes well to new, unseen data. 46 | 47 | #### ROC AUC 48 | 49 | ROC AUC (Receiver Operating Characteristic Area Under the Curve) is a performance metric used to evaluate the classification models, particularly in binary classification problems. It's a graphical representation of the model's ability to distinguish between the positive and negative classes by varying the classification threshold. 50 | 51 | Here's a breakdown of the components: 52 | 53 | - **ROC Curve**: The ROC curve is a graphical plot that illustrates the model's true positive rate (sensitivity) against the false positive rate (1 - specificity) for different classification thresholds. Each point on the ROC curve represents a sensitivity-specificity pair corresponding to a particular threshold. 54 | 55 | - **AUC Score**: The AUC score is the area under the ROC curve. It quantifies the overall performance of the model across all possible classification thresholds. A higher AUC score indicates better model performance, with a score of 1 representing a perfect model and a score of 0.5 representing a random guess. 56 | 57 | The ROC AUC provides insights into how well the model can discriminate between the positive and negative classes. A higher AUC indicates that the model has a better ability to distinguish between the two classes, making it a popular evaluation metric in binary classification tasks. 58 | 59 | In summary, ROC AUC is a widely used metric that condenses the ROC curve's information into a single score, providing a convenient way to assess the model's performance in binary classification problems. 60 | 61 | #### Area Under the Curve AUC 62 | 63 | By systematically evaluating a machine learning model, you gain insights into its strengths, weaknesses, and areas for improvement, ultimately leading to more robust and effective models for solving real-world problems. 64 | 65 | The Area Under the Curve (AUC) score is a metric used to evaluate the performance of a binary classification model. It measures the model's ability to distinguish between the positive and negative classes. 66 | 67 | In a binary classification problem, you have a positive class (e.g., presence of a disease) and a negative class (e.g., absence of a disease). The AUC score quantifies the model's ability to rank or score examples from the positive class higher than examples from the negative class. 68 | 69 | The AUC score is particularly useful when the dataset is imbalanced, meaning there's a significant difference in the number of examples between the positive and negative classes. 70 | 71 | Here's a brief explanation of how the AUC score is interpreted: 72 | 73 | - **AUC = 1**: The model perfectly distinguishes between the positive and negative classes, i.e., it ranks all positives higher than all negatives. 74 | 75 | - **AUC = 0.5**: The model performs no better than random chance, indicating that it's unable to distinguish between the classes. 76 | 77 | - **AUC < 0.5**: The model is performing worse than random chance, essentially reversing the labels. 78 | 79 | - **0.5 < AUC < 1**: The model is making some useful distinctions between the classes, with a higher AUC indicating better performance. 80 | 81 | In summary, AUC is a valuable metric to evaluate the classification model's performance, especially in imbalanced datasets, by assessing its ability to correctly rank examples from different classes. -------------------------------------------------------------------------------- /05-deployment/Dockerfile: -------------------------------------------------------------------------------- 1 | # Use the base image 2 | FROM svizor/zoomcamp-model:3.10.12-slim 3 | 4 | # Set the working directory 5 | WORKDIR /app 6 | 7 | # Copy the Pipenv files to the container 8 | COPY Pipfile Pipfile.lock /app/ 9 | 10 | # Install pipenv and dependencies 11 | RUN pip install pipenv 12 | RUN pipenv install --system --deploy 13 | 14 | # Copy the Flask script to the container 15 | COPY app.py /app/ 16 | 17 | # Expose the port your Flask app runs on 18 | EXPOSE 8000 19 | 20 | # Run the Flask app with Gunicorn 21 | CMD ["gunicorn", "app:app", "--bind", "0.0.0.0:8000", "--workers", "4"] 22 | -------------------------------------------------------------------------------- /05-deployment/Pipfile: -------------------------------------------------------------------------------- 1 | [[source]] 2 | url = "https://pypi.org/simple" 3 | verify_ssl = true 4 | name = "pypi" 5 | 6 | [packages] 7 | scikit-learn = "*" 8 | flask = "*" 9 | gunicorn = "*" 10 | 11 | [dev-packages] 12 | 13 | [requires] 14 | python_version = "3.8" 15 | -------------------------------------------------------------------------------- /05-deployment/app.py: -------------------------------------------------------------------------------- 1 | # app.py - Your Flask application 2 | 3 | from flask import Flask, request, jsonify 4 | import pickle 5 | import os 6 | 7 | app = Flask(__name__) 8 | 9 | # Load the DictVectorizer 10 | with open('dv.bin', 'rb') as f: 11 | dv = pickle.load(f) 12 | 13 | file_path = 'model1.bin' if os.path.exists('model1.bin') else 'model2.bin' 14 | 15 | # Load the Logistic Regression model 16 | with open(file_path, 'rb') as f: 17 | model = pickle.load(f) 18 | 19 | # Define the prediction endpoint and route 20 | @app.route('/predict', methods=['POST']) 21 | def predict(): 22 | # get the json payload 23 | data = request.get_json() 24 | print("data",data) 25 | 26 | # Transform the new data using the DictVectorizer 27 | transformed_data = dv.transform([data]) 28 | 29 | # Process the data, make predictions using the model, and return the results 30 | probabilities = model.predict_proba(transformed_data) 31 | 32 | # break down the probabilities into yes or no score 33 | no_score, yes_score = probabilities[0] 34 | 35 | # get the class labels from the model 36 | no_label, yes_label = model.classes_ 37 | return jsonify({'yes': yes_score, 'no': no_score}) 38 | 39 | 40 | # load the app 41 | if __name__ == '__main__': 42 | app.run(host='0.0.0.0', port=8000) 43 | -------------------------------------------------------------------------------- /05-deployment/dv.bin: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ozkary/machine-learning-engineering/ed19b242fd476d9e8ad6f80385e75241acbc2c49/05-deployment/dv.bin -------------------------------------------------------------------------------- /05-deployment/model1.bin: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ozkary/machine-learning-engineering/ed19b242fd476d9e8ad6f80385e75241acbc2c49/05-deployment/model1.bin -------------------------------------------------------------------------------- /06-trees/ozkary-decision-tree.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ozkary/machine-learning-engineering/ed19b242fd476d9e8ad6f80385e75241acbc2c49/06-trees/ozkary-decision-tree.png -------------------------------------------------------------------------------- /07-midterm-project/README.md: -------------------------------------------------------------------------------- 1 | # Mid-Term Project 2 | 3 | Check out this project under the project folder: 4 | 5 | [Heart Disease Risk](../projects/heart-disease-risk/) -------------------------------------------------------------------------------- /08-deep-learning/images/cnn_metrics.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ozkary/machine-learning-engineering/ed19b242fd476d9e8ad6f80385e75241acbc2c49/08-deep-learning/images/cnn_metrics.png -------------------------------------------------------------------------------- /08-deep-learning/images/cnn_metrics_augmented.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ozkary/machine-learning-engineering/ed19b242fd476d9e8ad6f80385e75241acbc2c49/08-deep-learning/images/cnn_metrics_augmented.png -------------------------------------------------------------------------------- /08-deep-learning/images/ozkary-convolutional-neural-network-bk.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ozkary/machine-learning-engineering/ed19b242fd476d9e8ad6f80385e75241acbc2c49/08-deep-learning/images/ozkary-convolutional-neural-network-bk.png -------------------------------------------------------------------------------- /08-deep-learning/images/ozkary-convolutional-neural-network.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ozkary/machine-learning-engineering/ed19b242fd476d9e8ad6f80385e75241acbc2c49/08-deep-learning/images/ozkary-convolutional-neural-network.png -------------------------------------------------------------------------------- /09-serverless/.gitignore: -------------------------------------------------------------------------------- 1 | bees-*.h5 -------------------------------------------------------------------------------- /09-serverless/Dockerfile: -------------------------------------------------------------------------------- 1 | # Use the base image 2 | FROM agrigorev/zoomcamp-bees-wasps:v2 3 | 4 | #define a working directory 5 | WORKDIR . 6 | 7 | # Set the working directory azure functions 8 | # WORKDIR /home/site/wwwroot 9 | 10 | # ENV AzureWebJobsScriptRoot=/home/site/wwwroot \ 11 | # AzureFunctionsJobHost__Logging__Console__IsEnabled=true 12 | 13 | # Copy the script to the container 14 | # COPY *.py /home/site/wwwroot 15 | COPY main.py . 16 | COPY img_ai ./img_ai/ 17 | 18 | # Copy the Pipenv files to the container 19 | # COPY Pipfile Pipfile.lock ./ 20 | 21 | # Install pipenv and dependencies 22 | # RUN pip install pipenv 23 | # RUN pipenv install --system --deploy 24 | RUN pip install numpy 25 | RUN pip install Pillow 26 | RUN pip install https://github.com/alexeygrigorev/tflite-aws-lambda/raw/main/tflite/tflite_runtime-2.14.0-cp310-cp310-linux_x86_64.whl 27 | 28 | # use as the entry point that should be called by the serverless functions 29 | CMD ["main.main"] 30 | 31 | # build the image 32 | # docker build -t ozkary/bees-wasps:v1 . 33 | # run from the container 34 | # docker run -p 8080:8080 -it ozkary/bees-wasps:v1 35 | # check files in container 36 | # docker exec -it ozkary/bees-wasps:v1 ls . 37 | -------------------------------------------------------------------------------- /09-serverless/Pipfile: -------------------------------------------------------------------------------- 1 | [[source]] 2 | url = "https://pypi.org/simple" 3 | verify_ssl = true 4 | name = "pypi" 5 | 6 | [packages] 7 | pillow = "*" 8 | tflite-runtime = "*" 9 | numpy = "*" 10 | 11 | [dev-packages] 12 | 13 | [requires] 14 | python_version = "3.8" 15 | -------------------------------------------------------------------------------- /09-serverless/Pipfile.lock: -------------------------------------------------------------------------------- 1 | { 2 | "_meta": { 3 | "hash": { 4 | "sha256": "58ce7f10a6c38593542b7bc5c78ba05a455726fe79e51357cebe942b0a130608" 5 | }, 6 | "pipfile-spec": 6, 7 | "requires": { 8 | "python_version": "3.8" 9 | }, 10 | "sources": [ 11 | { 12 | "name": "pypi", 13 | "url": "https://pypi.org/simple", 14 | "verify_ssl": true 15 | } 16 | ] 17 | }, 18 | "default": { 19 | "numpy": { 20 | "hashes": [ 21 | "sha256:04640dab83f7c6c85abf9cd729c5b65f1ebd0ccf9de90b270cd61935eef0197f", 22 | "sha256:1452241c290f3e2a312c137a9999cdbf63f78864d63c79039bda65ee86943f61", 23 | "sha256:222e40d0e2548690405b0b3c7b21d1169117391c2e82c378467ef9ab4c8f0da7", 24 | "sha256:2541312fbf09977f3b3ad449c4e5f4bb55d0dbf79226d7724211acc905049400", 25 | "sha256:31f13e25b4e304632a4619d0e0777662c2ffea99fcae2029556b17d8ff958aef", 26 | "sha256:4602244f345453db537be5314d3983dbf5834a9701b7723ec28923e2889e0bb2", 27 | "sha256:4979217d7de511a8d57f4b4b5b2b965f707768440c17cb70fbf254c4b225238d", 28 | "sha256:4c21decb6ea94057331e111a5bed9a79d335658c27ce2adb580fb4d54f2ad9bc", 29 | "sha256:6620c0acd41dbcb368610bb2f4d83145674040025e5536954782467100aa8835", 30 | "sha256:692f2e0f55794943c5bfff12b3f56f99af76f902fc47487bdfe97856de51a706", 31 | "sha256:7215847ce88a85ce39baf9e89070cb860c98fdddacbaa6c0da3ffb31b3350bd5", 32 | "sha256:79fc682a374c4a8ed08b331bef9c5f582585d1048fa6d80bc6c35bc384eee9b4", 33 | "sha256:7ffe43c74893dbf38c2b0a1f5428760a1a9c98285553c89e12d70a96a7f3a4d6", 34 | "sha256:80f5e3a4e498641401868df4208b74581206afbee7cf7b8329daae82676d9463", 35 | "sha256:95f7ac6540e95bc440ad77f56e520da5bf877f87dca58bd095288dce8940532a", 36 | "sha256:9667575fb6d13c95f1b36aca12c5ee3356bf001b714fc354eb5465ce1609e62f", 37 | "sha256:a5425b114831d1e77e4b5d812b69d11d962e104095a5b9c3b641a218abcc050e", 38 | "sha256:b4bea75e47d9586d31e892a7401f76e909712a0fd510f58f5337bea9572c571e", 39 | "sha256:b7b1fc9864d7d39e28f41d089bfd6353cb5f27ecd9905348c24187a768c79694", 40 | "sha256:befe2bf740fd8373cf56149a5c23a0f601e82869598d41f8e188a0e9869926f8", 41 | "sha256:c0bfb52d2169d58c1cdb8cc1f16989101639b34c7d3ce60ed70b19c63eba0b64", 42 | "sha256:d11efb4dbecbdf22508d55e48d9c8384db795e1b7b51ea735289ff96613ff74d", 43 | "sha256:dd80e219fd4c71fc3699fc1dadac5dcf4fd882bfc6f7ec53d30fa197b8ee22dc", 44 | "sha256:e2926dac25b313635e4d6cf4dc4e51c8c0ebfed60b801c799ffc4c32bf3d1254", 45 | "sha256:e98f220aa76ca2a977fe435f5b04d7b3470c0a2e6312907b37ba6068f26787f2", 46 | "sha256:ed094d4f0c177b1b8e7aa9cba7d6ceed51c0e569a5318ac0ca9a090680a6a1b1", 47 | "sha256:f136bab9c2cfd8da131132c2cf6cc27331dd6fae65f95f69dcd4ae3c3639c810", 48 | "sha256:f3a86ed21e4f87050382c7bc96571755193c4c1392490744ac73d660e8f564a9" 49 | ], 50 | "index": "pypi", 51 | "markers": "python_version >= '3.8'", 52 | "version": "==1.24.4" 53 | }, 54 | "pillow": { 55 | "hashes": [ 56 | "sha256:00f438bb841382b15d7deb9a05cc946ee0f2c352653c7aa659e75e592f6fa17d", 57 | "sha256:0248f86b3ea061e67817c47ecbe82c23f9dd5d5226200eb9090b3873d3ca32de", 58 | "sha256:04f6f6149f266a100374ca3cc368b67fb27c4af9f1cc8cb6306d849dcdf12616", 59 | "sha256:062a1610e3bc258bff2328ec43f34244fcec972ee0717200cb1425214fe5b839", 60 | "sha256:0a026c188be3b443916179f5d04548092e253beb0c3e2ee0a4e2cdad72f66099", 61 | "sha256:0f7c276c05a9767e877a0b4c5050c8bee6a6d960d7f0c11ebda6b99746068c2a", 62 | "sha256:1a8413794b4ad9719346cd9306118450b7b00d9a15846451549314a58ac42219", 63 | "sha256:1ab05f3db77e98f93964697c8efc49c7954b08dd61cff526b7f2531a22410106", 64 | "sha256:1c3ac5423c8c1da5928aa12c6e258921956757d976405e9467c5f39d1d577a4b", 65 | "sha256:1c41d960babf951e01a49c9746f92c5a7e0d939d1652d7ba30f6b3090f27e412", 66 | "sha256:1fafabe50a6977ac70dfe829b2d5735fd54e190ab55259ec8aea4aaea412fa0b", 67 | "sha256:1fb29c07478e6c06a46b867e43b0bcdb241b44cc52be9bc25ce5944eed4648e7", 68 | "sha256:24fadc71218ad2b8ffe437b54876c9382b4a29e030a05a9879f615091f42ffc2", 69 | "sha256:2cdc65a46e74514ce742c2013cd4a2d12e8553e3a2563c64879f7c7e4d28bce7", 70 | "sha256:2ef6721c97894a7aa77723740a09547197533146fba8355e86d6d9a4a1056b14", 71 | "sha256:3b834f4b16173e5b92ab6566f0473bfb09f939ba14b23b8da1f54fa63e4b623f", 72 | "sha256:3d929a19f5469b3f4df33a3df2983db070ebb2088a1e145e18facbc28cae5b27", 73 | "sha256:41f67248d92a5e0a2076d3517d8d4b1e41a97e2df10eb8f93106c89107f38b57", 74 | "sha256:47e5bf85b80abc03be7455c95b6d6e4896a62f6541c1f2ce77a7d2bb832af262", 75 | "sha256:4d0152565c6aa6ebbfb1e5d8624140a440f2b99bf7afaafbdbf6430426497f28", 76 | "sha256:50d08cd0a2ecd2a8657bd3d82c71efd5a58edb04d9308185d66c3a5a5bed9610", 77 | "sha256:61f1a9d247317fa08a308daaa8ee7b3f760ab1809ca2da14ecc88ae4257d6172", 78 | "sha256:6932a7652464746fcb484f7fc3618e6503d2066d853f68a4bd97193a3996e273", 79 | "sha256:7a7e3daa202beb61821c06d2517428e8e7c1aab08943e92ec9e5755c2fc9ba5e", 80 | "sha256:7dbaa3c7de82ef37e7708521be41db5565004258ca76945ad74a8e998c30af8d", 81 | "sha256:7df5608bc38bd37ef585ae9c38c9cd46d7c81498f086915b0f97255ea60c2818", 82 | "sha256:806abdd8249ba3953c33742506fe414880bad78ac25cc9a9b1c6ae97bedd573f", 83 | "sha256:883f216eac8712b83a63f41b76ddfb7b2afab1b74abbb413c5df6680f071a6b9", 84 | "sha256:912e3812a1dbbc834da2b32299b124b5ddcb664ed354916fd1ed6f193f0e2d01", 85 | "sha256:937bdc5a7f5343d1c97dc98149a0be7eb9704e937fe3dc7140e229ae4fc572a7", 86 | "sha256:9882a7451c680c12f232a422730f986a1fcd808da0fd428f08b671237237d651", 87 | "sha256:9a92109192b360634a4489c0c756364c0c3a2992906752165ecb50544c251312", 88 | "sha256:9d7bc666bd8c5a4225e7ac71f2f9d12466ec555e89092728ea0f5c0c2422ea80", 89 | "sha256:a5f63b5a68daedc54c7c3464508d8c12075e56dcfbd42f8c1bf40169061ae666", 90 | "sha256:a646e48de237d860c36e0db37ecaecaa3619e6f3e9d5319e527ccbc8151df061", 91 | "sha256:a89b8312d51715b510a4fe9fc13686283f376cfd5abca8cd1c65e4c76e21081b", 92 | "sha256:a92386125e9ee90381c3369f57a2a50fa9e6aa8b1cf1d9c4b200d41a7dd8e992", 93 | "sha256:ae88931f93214777c7a3aa0a8f92a683f83ecde27f65a45f95f22d289a69e593", 94 | "sha256:afc8eef765d948543a4775f00b7b8c079b3321d6b675dde0d02afa2ee23000b4", 95 | "sha256:b0eb01ca85b2361b09480784a7931fc648ed8b7836f01fb9241141b968feb1db", 96 | "sha256:b1c25762197144e211efb5f4e8ad656f36c8d214d390585d1d21281f46d556ba", 97 | "sha256:b4005fee46ed9be0b8fb42be0c20e79411533d1fd58edabebc0dd24626882cfd", 98 | "sha256:b920e4d028f6442bea9a75b7491c063f0b9a3972520731ed26c83e254302eb1e", 99 | "sha256:baada14941c83079bf84c037e2d8b7506ce201e92e3d2fa0d1303507a8538212", 100 | "sha256:bb40c011447712d2e19cc261c82655f75f32cb724788df315ed992a4d65696bb", 101 | "sha256:c0949b55eb607898e28eaccb525ab104b2d86542a85c74baf3a6dc24002edec2", 102 | "sha256:c9aeea7b63edb7884b031a35305629a7593272b54f429a9869a4f63a1bf04c34", 103 | "sha256:cfe96560c6ce2f4c07d6647af2d0f3c54cc33289894ebd88cfbb3bcd5391e256", 104 | "sha256:d27b5997bdd2eb9fb199982bb7eb6164db0426904020dc38c10203187ae2ff2f", 105 | "sha256:d921bc90b1defa55c9917ca6b6b71430e4286fc9e44c55ead78ca1a9f9eba5f2", 106 | "sha256:e6bf8de6c36ed96c86ea3b6e1d5273c53f46ef518a062464cd7ef5dd2cf92e38", 107 | "sha256:eaed6977fa73408b7b8a24e8b14e59e1668cfc0f4c40193ea7ced8e210adf996", 108 | "sha256:fa1d323703cfdac2036af05191b969b910d8f115cf53093125e4058f62012c9a", 109 | "sha256:fe1e26e1ffc38be097f0ba1d0d07fcade2bcfd1d023cda5b29935ae8052bd793" 110 | ], 111 | "index": "pypi", 112 | "markers": "python_version >= '3.8'", 113 | "version": "==10.1.0" 114 | }, 115 | "tflite-runtime": { 116 | "hashes": [ 117 | "sha256:195ab752e7e57329a68e54dd3dd5439fad888b9bff1be0f0dc042a3237a90e4d", 118 | "sha256:437167fe3d8b12f50f5d694da8f45d268ab84a495e24c3dd810e02e1012125de", 119 | "sha256:4aa740210a0fd9e4db4a46e9778914846b136e161525681b41575ca4896158fb", 120 | "sha256:79d8e17f68cc940df7e68a177b22dda60fcffba195fb9dd908d03724d65fd118", 121 | "sha256:7fe33f763263d1ff2733a09945a7547ab063d8bc311fd2a1be8144d850016ad3", 122 | "sha256:9f965054467f7890e678943858c6ac76a5197b17f61b48dcbaaba0af41d541a7", 123 | "sha256:bb11df4283e281cd609c621ac9470ad0cb5674408593272d7593a2c6bde8a808", 124 | "sha256:be198b7dc4401204be54a15884d9e336389790eb707439524540f5a9329fdd02", 125 | "sha256:c4e66a74165b18089c86788400af19fa551768ac782d231a9beae2f6434f7949", 126 | "sha256:ce9fa5d770a9725c746dcbf6f59f3178233b3759f09982e8b2db8d2234c333b0", 127 | "sha256:d38c6885f5e9673c11a61ccec5cad7c032ab97340718d26b17794137f398b780", 128 | "sha256:eca7672adca32727bbf5c0f1caf398fc17bbe222f2a684c7a2caea6fc6767203" 129 | ], 130 | "index": "pypi", 131 | "version": "==2.14.0" 132 | } 133 | }, 134 | "develop": {} 135 | } 136 | -------------------------------------------------------------------------------- /09-serverless/img_ai/__init__.py: -------------------------------------------------------------------------------- 1 | # your_package/__init__.py 2 | from .__version__ import __version__ 3 | from .bees_wasps import BeesWaspsModel 4 | 5 | print(f"Initializing img_ai {__version__}") -------------------------------------------------------------------------------- /09-serverless/img_ai/__version__.py: -------------------------------------------------------------------------------- 1 | # your_package/__version__.py 2 | 3 | __version__ = "0.1.0" # Replace with your desired version number 4 | -------------------------------------------------------------------------------- /09-serverless/img_ai/bees_wasps.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # coding: utf-8 3 | 4 | # # Machine Learning - Serverless Hosting 5 | # 6 | # Machine Learning models can be hosted on server less functions. To host this model, we must consider the size of the packages and their security settings. 7 | 8 | import tflite_runtime.interpreter as tflite 9 | from PIL import Image 10 | import numpy as np 11 | 12 | from io import BytesIO 13 | from urllib import request 14 | 15 | # define a class for the bees vs wasps model inference 16 | class BeesWaspsModel: 17 | def __init__(self, path): 18 | # load the lite model and the input/output details 19 | self.interpreter, self.input_details, self.output_details = self.load_lite_model(path) 20 | 21 | def load_lite_model(self, path): 22 | """ 23 | Load the TensorFlow Lite model and allocate tensors. 24 | """ 25 | # Load the TFLite model and allocate tensors. 26 | interpreter = tflite.Interpreter(model_path=path) 27 | interpreter.allocate_tensors() 28 | 29 | # Get input and output tensors. 30 | input_details = interpreter.get_input_details() 31 | output_details = interpreter.get_output_details() 32 | 33 | return interpreter, input_details, output_details 34 | 35 | def download_image(self, url): 36 | """ 37 | download the image from the url 38 | """ 39 | with request.urlopen(url) as resp: 40 | buffer = resp.read() 41 | 42 | stream = BytesIO(buffer) 43 | img = Image.open(stream) 44 | 45 | return img 46 | 47 | def prepare_image(self, img, target_size): 48 | """ 49 | Resize the image to target_size i.e. (150,150) 50 | """ 51 | if img.mode != 'RGB': 52 | img = img.convert('RGB') 53 | img = img.resize(target_size, Image.NEAREST) 54 | return img 55 | 56 | def preprocess_image(self, img): 57 | """ 58 | convert the image to a numpy array and normalize it 59 | """ 60 | # convert to numpy array 61 | img = np.array(img) 62 | 63 | # convert to float32 to avoid overflow when multiplying by 255 64 | img = img.astype('float32') 65 | 66 | # normalize to the range 0-1 67 | img /= 255 68 | 69 | return img 70 | 71 | def download_and_preprocess_image(self, url, target_size): 72 | """ 73 | Download the image from the url, resize it to target_size and return a numpy array 74 | """ 75 | 76 | img_stream = self.download_image(url) 77 | img = self.prepare_image(img_stream, target_size) 78 | img_normalized = self.preprocess_image(img) 79 | 80 | return img_normalized 81 | 82 | def img_inference(self, img_normalized): 83 | """ 84 | Load the TensorFlow Lite model and run the inference 85 | 86 | def img_inference(img_normalized): 87 | """ 88 | 89 | # set the input tensor with the normalized image 90 | self.interpreter.set_tensor(self.input_details[0]['index'], [img_normalized]) 91 | 92 | # run the inference 93 | self.interpreter.invoke() 94 | 95 | # get the output tensor 96 | output_data = self.interpreter.get_tensor(self.output_details[0]['index']) 97 | 98 | # # get the result value 99 | # result = round(output_data[0][0],3) 100 | 101 | # # print the output 102 | # print('Tensor Output', result) 103 | 104 | return output_data[0].tolist() 105 | 106 | 107 | -------------------------------------------------------------------------------- /09-serverless/main.py: -------------------------------------------------------------------------------- 1 | # create a function 'predict' and use the class BeesWaspsModel from the bees-wasps.py to download and preprocess the image 2 | from img_ai.__version__ import __version__ 3 | from img_ai.bees_wasps import BeesWaspsModel 4 | 5 | def predict(url): 6 | 7 | # from local folder 8 | # bees_wasps_model = BeesWaspsModel('./models/bees-wasps.tflite') 9 | 10 | # from the docker image 11 | bees_wasps_model = BeesWaspsModel('bees-wasps-v2.tflite') 12 | 13 | # download the image 14 | img_normalized = bees_wasps_model.download_and_preprocess_image(url, (150,150)) 15 | 16 | # prepare the image 17 | img = bees_wasps_model.img_inference(img_normalized) 18 | 19 | return img 20 | 21 | def main(event, context): 22 | url = event['url'] 23 | result = predict(url) 24 | return result 25 | 26 | -------------------------------------------------------------------------------- /10-kubernetes/src/Dockerfile: -------------------------------------------------------------------------------- 1 | FROM svizor/zoomcamp-model:3.10.12-slim 2 | 3 | RUN pip install pipenv 4 | COPY ["Pipfile", "Pipfile.lock", "./"] 5 | RUN pipenv install --system --deploy 6 | 7 | COPY ["q6_predict.py", "./"] 8 | EXPOSE 9696 9 | ENTRYPOINT ["waitress-serve", "--listen=0.0.0.0:9696", "q6_predict:app"] -------------------------------------------------------------------------------- /10-kubernetes/src/Pipfile: -------------------------------------------------------------------------------- 1 | [[source]] 2 | url = "https://pypi.org/simple" 3 | verify_ssl = true 4 | name = "pypi" 5 | 6 | [packages] 7 | scikit-learn = "==1.3.1" 8 | flask = "*" 9 | waitress = "*" 10 | 11 | [dev-packages] 12 | 13 | [requires] 14 | python_version = "3.10" 15 | -------------------------------------------------------------------------------- /10-kubernetes/src/deployment.yaml: -------------------------------------------------------------------------------- 1 | apiVersion: apps/v1 2 | kind: Deployment 3 | metadata: 4 | name: credit 5 | spec: 6 | selector: 7 | matchLabels: 8 | app: credit 9 | replicas: 1 10 | template: 11 | metadata: 12 | labels: 13 | app: credit 14 | spec: 15 | containers: 16 | - name: credit 17 | image: zoomcamp-model:hw10 18 | resources: 19 | requests: 20 | memory: "64Mi" 21 | cpu: "100m" 22 | limits: 23 | memory: "128Mi" # Adjust based on your application's requirements 24 | cpu: "200m" # Adjust based on your application's requirements 25 | ports: 26 | - containerPort: 9696 27 | -------------------------------------------------------------------------------- /10-kubernetes/src/dv.bin: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ozkary/machine-learning-engineering/ed19b242fd476d9e8ad6f80385e75241acbc2c49/10-kubernetes/src/dv.bin -------------------------------------------------------------------------------- /10-kubernetes/src/hpa.yaml: -------------------------------------------------------------------------------- 1 | apiVersion: autoscaling/v2beta2 2 | kind: HorizontalPodAutoscaler 3 | metadata: 4 | name: credit-autoscaler 5 | spec: 6 | scaleTargetRef: 7 | apiVersion: apps/v1 8 | kind: Deployment 9 | name: credit 10 | minReplicas: 1 11 | maxReplicas: 3 12 | metrics: 13 | - type: Resource 14 | resource: 15 | name: cpu 16 | targetAverageUtilization: 20 -------------------------------------------------------------------------------- /10-kubernetes/src/model1.bin: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ozkary/machine-learning-engineering/ed19b242fd476d9e8ad6f80385e75241acbc2c49/10-kubernetes/src/model1.bin -------------------------------------------------------------------------------- /10-kubernetes/src/q3_test.py: -------------------------------------------------------------------------------- 1 | import pickle 2 | 3 | 4 | def load(filename: str): 5 | with open(filename, 'rb') as f_in: 6 | return pickle.load(f_in) 7 | 8 | 9 | dv = load('dv.bin') 10 | model = load('model1.bin') 11 | 12 | client = {"job": "retired", "duration": 445, "poutcome": "success"} 13 | 14 | X = dv.transform([client]) 15 | y_pred = model.predict_proba(X)[0, 1] 16 | 17 | print(y_pred) 18 | -------------------------------------------------------------------------------- /10-kubernetes/src/q4_predict.py: -------------------------------------------------------------------------------- 1 | import pickle 2 | 3 | from flask import Flask 4 | from flask import request 5 | from flask import jsonify 6 | 7 | 8 | def load(filename: str): 9 | with open(filename, 'rb') as f_in: 10 | return pickle.load(f_in) 11 | 12 | 13 | dv = load('dv.bin') 14 | model = load('model1.bin') 15 | 16 | app = Flask('get-credit') 17 | 18 | 19 | @app.route('/predict', methods=['POST']) 20 | def predict(): 21 | client = request.get_json() 22 | 23 | X = dv.transform([client]) 24 | y_pred = model.predict_proba(X)[0, 1] 25 | get_credit = y_pred >= 0.5 26 | 27 | result = { 28 | 'get_credit_probability': float(y_pred), 29 | 'get_credit': bool(get_credit) 30 | } 31 | 32 | return jsonify(result) 33 | 34 | 35 | if __name__ == "__main__": 36 | app.run(debug=True, host='0.0.0.0', port=9696) 37 | -------------------------------------------------------------------------------- /10-kubernetes/src/q4_test.py: -------------------------------------------------------------------------------- 1 | import requests 2 | 3 | 4 | url = "http://localhost:9696/predict" 5 | 6 | client = {"job": "unknown", "duration": 270, "poutcome": "failure"} 7 | response = requests.post(url, json=client).json() 8 | 9 | print(response) 10 | -------------------------------------------------------------------------------- /10-kubernetes/src/q6_predict.py: -------------------------------------------------------------------------------- 1 | import pickle 2 | 3 | from flask import Flask 4 | from flask import request 5 | from flask import jsonify 6 | 7 | 8 | def load(filename: str): 9 | with open(filename, 'rb') as f_in: 10 | return pickle.load(f_in) 11 | 12 | 13 | dv = load('dv.bin') 14 | model = load('model2.bin') 15 | 16 | app = Flask('get-credit') 17 | 18 | 19 | @app.route('/predict', methods=['POST']) 20 | def predict(): 21 | client = request.get_json() 22 | 23 | X = dv.transform([client]) 24 | y_pred = model.predict_proba(X)[0, 1] 25 | get_credit = y_pred >= 0.5 26 | 27 | result = { 28 | 'get_credit_probability': float(y_pred), 29 | 'get_credit': bool(get_credit) 30 | } 31 | 32 | return jsonify(result) 33 | 34 | 35 | if __name__ == "__main__": 36 | app.run(debug=True, host='0.0.0.0', port=9696) 37 | -------------------------------------------------------------------------------- /10-kubernetes/src/q6_test.py: -------------------------------------------------------------------------------- 1 | import requests 2 | 3 | 4 | url = "http://localhost:9696/predict" 5 | 6 | client = {"job": "retired", "duration": 445, "poutcome": "success"} 7 | response = requests.post(url, json=client).json() 8 | 9 | print(response) 10 | -------------------------------------------------------------------------------- /10-kubernetes/src/q6_test_loop.py: -------------------------------------------------------------------------------- 1 | from asyncio import sleep 2 | import random 3 | import requests 4 | 5 | 6 | url = "http://localhost:9696/predict" 7 | client = {"job": "retired", "duration": 445, "poutcome": "success"} 8 | 9 | # get random values for retired, working, unemployed 10 | status = ["retired", "working", "unemployed"] 11 | 12 | 13 | while True: 14 | sleep(.1) 15 | duration = random.randint(400, 500) 16 | client["duration"] = duration 17 | client["job"] = random.choice(status) 18 | response = requests.post(url, json=client).json() 19 | print(response) 20 | -------------------------------------------------------------------------------- /10-kubernetes/src/service.yaml: -------------------------------------------------------------------------------- 1 | apiVersion: v1 2 | kind: Service 3 | metadata: 4 | name: ozkary-credit-service 5 | spec: 6 | type: LoadBalancer 7 | selector: 8 | app: credit 9 | ports: 10 | - port: 80 11 | targetPort: 9696 -------------------------------------------------------------------------------- /11-kserve/README.md: -------------------------------------------------------------------------------- 1 | # Machine Learning - KServe Hosting ML Models 2 | 3 | ## KServe Overview: 4 | 5 | **Purpose:** 6 | KServe aims to provide a standardized and scalable solution for deploying, serving, and managing machine learning models in Kubernetes clusters. Kubernetes is a container orchestration platform that allows for efficient deployment, scaling, and management of containerized applications. 7 | 8 | **Key Features:** 9 | 10 | 1. **Scalability:** 11 | - KServe allows for the efficient scaling of ML model serving based on demand. It can automatically scale the number of serving instances up or down to handle varying workloads. 12 | 13 | 2. **Model Versioning:** 14 | - KServe supports multiple versions of a model, enabling easy testing and deployment of new model versions. This facilitates gradual rollouts and A/B testing. 15 | 16 | 3. **Inference Acceleration:** 17 | - The framework often integrates with specialized hardware accelerators (such as GPUs) to optimize model inference, ensuring low-latency responses for real-time applications. 18 | 19 | 4. **Multi-framework Support:** 20 | - KServe is designed to support models built with various machine learning frameworks, providing flexibility for data scientists and developers to use their preferred tools. 21 | 22 | 5. **Monitoring and Logging:** 23 | - KServe typically includes monitoring and logging capabilities, allowing users to track the performance of deployed models, collect logs, and monitor resource usage. 24 | 25 | 6. **Integration with Kubernetes:** 26 | - Being designed for Kubernetes, KServe seamlessly integrates with other Kubernetes components and can leverage features like auto-scaling and resource management. 27 | 28 | 7. **Cloud-Native:** 29 | - KServe aligns with cloud-native principles, making it well-suited for deployment in cloud environments. 30 | 31 | **How to Use:** 32 | Users typically define a KServe custom resource (such as a KServe InferenceService) to deploy and manage a machine learning model. This resource specifies details like the model URI, serving runtime, and other configuration parameters. 33 | -------------------------------------------------------------------------------- /11-kserve/iris_example.yaml: -------------------------------------------------------------------------------- 1 | apiVersion: "serving.kserve.io/v1beta1" 2 | kind: "InferenceService" 3 | metadata: 4 | name: "sklearn-iris" 5 | spec: 6 | predictor: 7 | model: 8 | modelFormat: 9 | name: sklearn 10 | storageUri: "gs://kfserving-examples/models/sklearn/1.0/model" -------------------------------------------------------------------------------- /11-kserve/quick_install.sh: -------------------------------------------------------------------------------- 1 | set -e 2 | ############################################################ 3 | # Help # 4 | ############################################################ 5 | Help() 6 | { 7 | # Display Help 8 | echo "KServe quick install script." 9 | echo 10 | echo "Syntax: [-s|-r]" 11 | echo "options:" 12 | echo "s Serverless Mode." 13 | echo "r RawDeployment Mode." 14 | echo 15 | } 16 | 17 | deploymentMode=serverless 18 | while getopts ":hsr" option; do 19 | case $option in 20 | h) # display Help 21 | Help 22 | exit;; 23 | r) # skip knative install 24 | deploymentMode=kubernetes;; 25 | s) # install knative 26 | deploymentMode=serverless;; 27 | \?) # Invalid option 28 | echo "Error: Invalid option" 29 | exit;; 30 | esac 31 | done 32 | 33 | export ISTIO_VERSION=1.17.2 34 | export KNATIVE_SERVING_VERSION=knative-v1.10.1 35 | export KNATIVE_ISTIO_VERSION=knative-v1.10.0 36 | export KSERVE_VERSION=v0.11.2 37 | export CERT_MANAGER_VERSION=v1.3.0 38 | export SCRIPT_DIR="$( dirname -- "${BASH_SOURCE[0]}" )" 39 | 40 | KUBE_VERSION=$(kubectl version --short=true | grep "Server Version" | awk -F '.' '{print $2}') 41 | if [ ${KUBE_VERSION} -lt 24 ]; 42 | then 43 | echo "😱 install requires at least Kubernetes 1.24"; 44 | exit 1; 45 | fi 46 | 47 | curl -L https://istio.io/downloadIstio | sh - 48 | cd istio-${ISTIO_VERSION} 49 | 50 | # Create istio-system namespace 51 | cat < ./istio-minimal-operator.yaml 61 | apiVersion: install.istio.io/v1beta1 62 | kind: IstioOperator 63 | spec: 64 | values: 65 | global: 66 | proxy: 67 | autoInject: disabled 68 | useMCP: false 69 | 70 | meshConfig: 71 | accessLogFile: /dev/stdout 72 | 73 | components: 74 | ingressGateways: 75 | - name: istio-ingressgateway 76 | enabled: true 77 | k8s: 78 | podAnnotations: 79 | cluster-autoscaler.kubernetes.io/safe-to-evict: "true" 80 | pilot: 81 | enabled: true 82 | k8s: 83 | resources: 84 | requests: 85 | cpu: 200m 86 | memory: 200Mi 87 | podAnnotations: 88 | cluster-autoscaler.kubernetes.io/safe-to-evict: "true" 89 | env: 90 | - name: PILOT_ENABLE_CONFIG_DISTRIBUTION_TRACKING 91 | value: "false" 92 | EOF 93 | 94 | bin/istioctl manifest apply -f istio-minimal-operator.yaml -y; 95 | 96 | echo "😀 Successfully installed Istio" 97 | 98 | # Install Knative 99 | if [ $deploymentMode = serverless ]; then 100 | kubectl apply --filename https://github.com/knative/serving/releases/download/${KNATIVE_SERVING_VERSION}/serving-crds.yaml 101 | kubectl apply --filename https://github.com/knative/serving/releases/download/${KNATIVE_SERVING_VERSION}/serving-core.yaml 102 | kubectl apply --filename https://github.com/knative/net-istio/releases/download/${KNATIVE_ISTIO_VERSION}/release.yaml 103 | # Patch the external domain as the default domain svc.cluster.local is not exposed on ingress 104 | kubectl patch cm config-domain --patch '{"data":{"example.com":""}}' -n knative-serving 105 | echo "😀 Successfully installed Knative" 106 | fi 107 | 108 | # Install Cert Manager 109 | kubectl apply --validate=false -f https://github.com/jetstack/cert-manager/releases/download/${CERT_MANAGER_VERSION}/cert-manager.yaml 110 | kubectl wait --for=condition=available --timeout=600s deployment/cert-manager-webhook -n cert-manager 111 | cd .. 112 | echo "😀 Successfully installed Cert Manager" 113 | 114 | # Install KServe 115 | KSERVE_CONFIG=kserve.yaml 116 | MAJOR_VERSION=$(echo ${KSERVE_VERSION:1} | cut -d "." -f1) 117 | MINOR_VERSION=$(echo ${KSERVE_VERSION} | cut -d "." -f2) 118 | if [ ${MAJOR_VERSION} -eq 0 ] && [ ${MINOR_VERSION} -le 6 ]; then KSERVE_CONFIG=kfserving.yaml; fi 119 | 120 | # Retry inorder to handle that it may take a minute or so for the TLS assets required for the webhook to function to be provisioned 121 | kubectl apply -f https://github.com/kserve/kserve/releases/download/${KSERVE_VERSION}/${KSERVE_CONFIG} 122 | 123 | # Install KServe built-in servingruntimes 124 | kubectl wait --for=condition=ready pod -l control-plane=kserve-controller-manager -n kserve --timeout=300s 125 | kubectl apply -f https://github.com/kserve/kserve/releases/download/${KSERVE_VERSION}/kserve-runtimes.yaml 126 | echo "😀 Successfully installed KServe" 127 | 128 | # Clean up 129 | rm -rf istio-${ISTIO_VERSION} 130 | -------------------------------------------------------------------------------- /CODE_OF_CONDUCT.md: -------------------------------------------------------------------------------- 1 | # Citizen Code of Conduct 2 | 3 | ## 1. Purpose 4 | 5 | A primary goal of Machine Learning Engineering is to be inclusive to the largest number of contributors, with the most varied and diverse backgrounds possible. As such, we are committed to providing a friendly, safe and welcoming environment for all, regardless of gender, sexual orientation, ability, ethnicity, socioeconomic status, and religion (or lack thereof). 6 | 7 | This code of conduct outlines our expectations for all those who participate in our community, as well as the consequences for unacceptable behavior. 8 | 9 | We invite all those who participate in Machine Learning Engineering to help us create safe and positive experiences for everyone. 10 | 11 | ## 2. Open [Source/Culture/Tech] Citizenship 12 | 13 | A supplemental goal of this Code of Conduct is to increase open [source/culture/tech] citizenship by encouraging participants to recognize and strengthen the relationships between our actions and their effects on our community. 14 | 15 | Communities mirror the societies in which they exist and positive action is essential to counteract the many forms of inequality and abuses of power that exist in society. 16 | 17 | If you see someone who is making an extra effort to ensure our community is welcoming, friendly, and encourages all participants to contribute to the fullest extent, we want to know. 18 | 19 | ## 3. Expected Behavior 20 | 21 | The following behaviors are expected and requested of all community members: 22 | 23 | * Participate in an authentic and active way. In doing so, you contribute to the health and longevity of this community. 24 | * Exercise consideration and respect in your speech and actions. 25 | * Attempt collaboration before conflict. 26 | * Refrain from demeaning, discriminatory, or harassing behavior and speech. 27 | * Be mindful of your surroundings and of your fellow participants. Alert community leaders if you notice a dangerous situation, someone in distress, or violations of this Code of Conduct, even if they seem inconsequential. 28 | * Remember that community event venues may be shared with members of the public; please be respectful to all patrons of these locations. 29 | 30 | ## 4. Unacceptable Behavior 31 | 32 | The following behaviors are considered harassment and are unacceptable within our community: 33 | 34 | * Violence, threats of violence or violent language directed against another person. 35 | * Sexist, racist, homophobic, transphobic, ableist or otherwise discriminatory jokes and language. 36 | * Posting or displaying sexually explicit or violent material. 37 | * Posting or threatening to post other people's personally identifying information ("doxing"). 38 | * Personal insults, particularly those related to gender, sexual orientation, race, religion, or disability. 39 | * Inappropriate photography or recording. 40 | * Inappropriate physical contact. You should have someone's consent before touching them. 41 | * Unwelcome sexual attention. This includes, sexualized comments or jokes; inappropriate touching, groping, and unwelcomed sexual advances. 42 | * Deliberate intimidation, stalking or following (online or in person). 43 | * Advocating for, or encouraging, any of the above behavior. 44 | * Sustained disruption of community events, including talks and presentations. 45 | 46 | ## 5. Weapons Policy 47 | 48 | No weapons will be allowed at Machine Learning Engineering events, community spaces, or in other spaces covered by the scope of this Code of Conduct. Weapons include but are not limited to guns, explosives (including fireworks), and large knives such as those used for hunting or display, as well as any other item used for the purpose of causing injury or harm to others. Anyone seen in possession of one of these items will be asked to leave immediately, and will only be allowed to return without the weapon. Community members are further expected to comply with all state and local laws on this matter. 49 | 50 | ## 6. Consequences of Unacceptable Behavior 51 | 52 | Unacceptable behavior from any community member, including sponsors and those with decision-making authority, will not be tolerated. 53 | 54 | Anyone asked to stop unacceptable behavior is expected to comply immediately. 55 | 56 | If a community member engages in unacceptable behavior, the community organizers may take any action they deem appropriate, up to and including a temporary ban or permanent expulsion from the community without warning (and without refund in the case of a paid event). 57 | 58 | ## 7. Reporting Guidelines 59 | 60 | If you are subject to or witness unacceptable behavior, or have any other concerns, please notify a community organizer as soon as possible. Oscar - ozkary@gmail.com. 61 | 62 | 63 | 64 | Additionally, community organizers are available to help community members engage with local law enforcement or to otherwise help those experiencing unacceptable behavior feel safe. In the context of in-person events, organizers will also provide escorts as desired by the person experiencing distress. 65 | 66 | ## 8. Addressing Grievances 67 | 68 | If you feel you have been falsely or unfairly accused of violating this Code of Conduct, you should notify with a concise description of your grievance. Your grievance will be handled in accordance with our existing governing policies. 69 | 70 | 71 | 72 | ## 9. Scope 73 | 74 | We expect all community participants (contributors, paid or otherwise; sponsors; and other guests) to abide by this Code of Conduct in all community venues--online and in-person--as well as in all one-on-one communications pertaining to community business. 75 | 76 | This code of conduct and its related procedures also applies to unacceptable behavior occurring outside the scope of community activities when such behavior has the potential to adversely affect the safety and well-being of community members. 77 | 78 | ## 10. Contact info 79 | 80 | Oscar - ozkary@gmail.com 81 | 82 | ## 11. License and attribution 83 | 84 | The Citizen Code of Conduct is distributed by [Stumptown Syndicate](http://stumptownsyndicate.org) under a [Creative Commons Attribution-ShareAlike license](http://creativecommons.org/licenses/by-sa/3.0/). 85 | 86 | Portions of text derived from the [Django Code of Conduct](https://www.djangoproject.com/conduct/) and the [Geek Feminism Anti-Harassment Policy](http://geekfeminism.wikia.com/wiki/Conference_anti-harassment/Policy). 87 | 88 | _Revision 2.3. Posted 6 March 2017._ 89 | 90 | _Revision 2.2. Posted 4 February 2016._ 91 | 92 | _Revision 2.1. Posted 23 June 2014._ 93 | 94 | _Revision 2.0, adopted by the [Stumptown Syndicate](http://stumptownsyndicate.org) board on 10 January 2013. Posted 17 March 2013._ 95 | -------------------------------------------------------------------------------- /CONTRIBUTING.md: -------------------------------------------------------------------------------- 1 | # Contributing Guidelines 2 | 3 | Thank you for your interest in contributing to the Machine Learning Engineering Repository! Contributions from the community help improve and expand this repository, making it a valuable resource for everyone involved in the field of machine learning. 4 | 5 | ## Ways to Contribute 6 | 7 | There are several ways you can contribute to this repository: 8 | 9 | 1. **Code Contributions**: 10 | - Implement new features, algorithms, or enhancements by creating pull requests. 11 | - Fix bugs or improve existing code to enhance the overall quality of the repository. 12 | - Ensure your code is well-documented and adheres to established coding standards. 13 | 14 | 2. **Documentation**: 15 | - Improve and enhance the existing documentation. 16 | - Create new guides, tutorials, or explanatory articles related to machine learning concepts. 17 | - Ensure that the documentation is clear, comprehensive, and easily understandable. 18 | 19 | 3. **Issue Reporting**: 20 | - Report issues, bugs, or inconsistencies you encounter within the repository. 21 | - Provide detailed information about the problem and steps to reproduce it. 22 | 23 | ## How to Contribute 24 | 25 | To contribute to this repository, follow these steps: 26 | 27 | 1. **Fork this Repository**: 28 | - Click the "Fork" button in the top-right corner of this repository to create a copy in your GitHub account. 29 | 30 | 2. **Clone your Fork**: 31 | - Clone the repository from your GitHub account to your local machine. 32 | 33 | ```bash 34 | git clone https://github.com/your-username/machine-learning-repo.git 35 | cd machine-learning-repo 36 | ``` 37 | 38 | 3. Create a Branch: 39 | - Create a new branch for your contribution. 40 | 41 | ```bash 42 | git checkout -b feature/new-feature 43 | ``` 44 | 45 | 4. Make Changes: 46 | - Make the necessary changes and additions to the codebase or documentation. 47 | 48 | 5. Commit Changes: 49 | - Commit your changes with a descriptive commit message. 50 | 51 | ```bash 52 | git commit -m "Add new feature: XYZ" 53 | ``` 54 | 55 | 6. Push Changes: 56 | - Push your changes to your forked repository. 57 | 58 | ```bash 59 | git push origin feature/new-feature 60 | ``` 61 | 62 | 7. Create a Pull Request (PR): 63 | - Go to the original repository and click on the "New Pull Request" button. 64 | - Select your branch and provide a detailed description of your changes. 65 | 66 | 8. Code Review: 67 | - Participate in discussions and address any feedback provided during the code review process. 68 | 69 | 9. Merge Pull Request: 70 | - Once approved, your contribution will be merged into the main repository. 71 | 72 | Thank you for your valuable contributions! 73 | 74 | ## Code of Conduct 75 | Please note that by contributing to this project, you agree to abide by our Code of Conduct. Make sure to familiarize yourself with its guidelines before contributing. 76 | 77 | ## License 78 | By contributing to this repository, you agree that your contributions will be licensed under the Apache License. -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Machine Learning Engineering 2 | 3 | Welcome to the Machine Learning Engineering Repository, a comprehensive collection of resources, code, and insights to guide you through the exciting world of machine learning. This repository is designed to provide valuable information, best practices, and hands-on examples for individuals keen on mastering the art and science of machine learning. 4 | 5 | ![Machine Learning Engineering](./images/machine-learning-engineering.jpg "Machine Learning Engineering") 6 | 7 | ## Contents 8 | 9 | - [Machine Learning Engineering](#machine-learning-engineering) 10 | - [Contents](#contents) 11 | - [Overview](#overview) 12 | - [Key Features](#key-features) 13 | - [Use Cases and Selecting Models](#use-cases-and-selecting-models) 14 | - [Getting Started](#getting-started) 15 | - [Contributing](#contributing) 16 | - [License](#license) 17 | 18 | ## Overview 19 | 20 | Machine learning is transforming the way we approach complex problems and make data-driven decisions. This repository serves as a hub for both beginners and seasoned ML engineers, offering a wealth of knowledge encompassing: 21 | 22 | - Fundamentals of machine learning 23 | - Various ML algorithms and techniques 24 | - Data preprocessing and feature engineering 25 | - Model evaluation and selection 26 | - Deployment and scaling strategies 27 | 28 | Whether you're just starting out or looking to expand your ML horizons, you'll find valuable content and practical code examples here. 29 | 30 | ## Key Features 31 | 32 | - **Comprehensive Guides**: Detailed tutorials and guides covering different aspects of the machine learning lifecycle. 33 | - **Hands-on Examples**: Practical code implementations to demonstrate the concepts discussed in the guides. 34 | - **Resource Recommendations**: Curated list of books, online courses, and research papers to deepen your understanding of machine learning. 35 | 36 | ### Use Cases and Selecting Models 37 | 38 | The following shows of how models can be used for certain use cases. 39 | 40 | 1. **Logistic Regression:** 41 | - **Use Case:** Predicting Health Risk (e.g., likelihood of a disease based on health indicators) 42 | - **Explanation:** Logistic Regression is appropriate when predicting the probability of a binary outcome, like the likelihood of a person having a specific health condition. 43 | 44 | 2. **Ridge Regression:** 45 | - **Use Case:** Vehicle Price Prediction 46 | - **Explanation:** Ridge Regression can help predict vehicle prices considering features like year, engine HP, and mileage, especially when there might be multicollinearity between features. 47 | 48 | 3. **Decision Trees:** 49 | - **Use Case:** Predicting Drug Interactions 50 | - **Explanation:** Decision Trees can help understand how different factors (e.g., drug dosage, patient age) contribute to potential drug interactions. 51 | 52 | 4. **Random Forests:** 53 | - **Use Case:** Identifying Defects in Vehicle Parts Over Time 54 | - **Explanation:** Random Forests can be effective in analyzing historical data on vehicle parts to predict defect occurrence based on various features and time. 55 | 56 | 5. **Support Vector Machines (SVM):** 57 | - **Use Case:** Manufacturing Equipment Output Quality Control 58 | - **Explanation:** SVM can classify equipment output as per quality standards based on multiple features, ensuring manufacturing consistency. 59 | 60 | 6. **K-Nearest Neighbors (KNN):** 61 | - **Use Case:** Predicting Manufacturing Equipment Maintenance Schedule 62 | - **Explanation:** KNN can use data on equipment behavior and performance to predict when maintenance is needed, based on similar historical cases. 63 | 64 | In summary, each model is suitable for different scenarios based on the nature of the problem and the type of data available. It's essential to understand your problem deeply, consider the available data, and experiment with different models to see what works best for your specific use case. 65 | 66 | 67 | ## Getting Started 68 | 69 | To begin your journey through the world of machine learning, head over to the [Getting Started](./getting_started.md) guide. This guide will walk you through setting up your development environment, understanding the basics, and running your first machine learning model. 70 | 71 | ## Contributing 72 | 73 | We welcome contributions from the community to enhance and expand this repository. If you have ideas, improvements, or new content to add, please review our [Contribution Guidelines](./CONTRIBUTING.md). 74 | 75 | ## License 76 | 77 | This repository is licensed under the [Apache License](./LICENSE), allowing you to use the code and resources within this repository in your projects. 78 | 79 | Happy learning and building with machine learning! 80 | 81 | -------------------------------------------------------------------------------- /getting_started.md: -------------------------------------------------------------------------------- 1 | # TODO -------------------------------------------------------------------------------- /images/machine-learning-engineering.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ozkary/machine-learning-engineering/ed19b242fd476d9e8ad6f80385e75241acbc2c49/images/machine-learning-engineering.jpg -------------------------------------------------------------------------------- /images/ozkary-machine-learning-classification.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ozkary/machine-learning-engineering/ed19b242fd476d9e8ad6f80385e75241acbc2c49/images/ozkary-machine-learning-classification.png -------------------------------------------------------------------------------- /projects/drug-drug-interaction/.gitignore: -------------------------------------------------------------------------------- 1 | *_input.csv 2 | *PCA50.csv 3 | cnn.py 4 | ml.py 5 | *.csv.gz 6 | discovery.ipynb 7 | 8 | # allow interaction_types.csv to be pushed 9 | !interaction_types.csv 10 | !drugbank_slim_df.csv 11 | !test_cases.csv 12 | -------------------------------------------------------------------------------- /projects/drug-drug-interaction/Dockerfile: -------------------------------------------------------------------------------- 1 | # Use the base image 2 | FROM svizor/zoomcamp-model:3.10.12-slim 3 | 4 | # Set the working directory 5 | WORKDIR /app 6 | 7 | # Copy the Pipenv files to the container 8 | COPY Pipfile Pipfile.lock /app/ 9 | 10 | # Install pipenv and dependencies 11 | RUN pip install pipenv 12 | RUN pipenv install --system --deploy 13 | 14 | # copy the bin files 15 | COPY models/ /app/models/ 16 | 17 | # copy the data files 18 | COPY data/interaction_types.csv /app/data/ 19 | COPY data/drugbank_pca50.csv.gz /app/data/ 20 | 21 | # Copy the DDI library 22 | COPY ddi_lib/ /app/ddi_lib/ 23 | 24 | # Copy the Flask script to the container 25 | COPY app.py /app/ 26 | 27 | # Expose the port the Flask app runs on 28 | EXPOSE 8000 29 | 30 | # Run the Flask app with Gunicorn 31 | CMD ["gunicorn", "app:app", "--bind", "0.0.0.0:8000", "--workers", "4"] -------------------------------------------------------------------------------- /projects/drug-drug-interaction/Pipfile: -------------------------------------------------------------------------------- 1 | [[source]] 2 | url = "https://pypi.org/simple" 3 | verify_ssl = true 4 | name = "pypi" 5 | 6 | [packages] 7 | flask = "*" 8 | gunicorn = "*" 9 | scikit-learn = "*" 10 | rdkit = "*" 11 | pandas = "*" 12 | numpy = "*" 13 | matplotlib = "*" 14 | seaborn = "*" 15 | xgboost = "*" 16 | azure-functions = "*" 17 | 18 | [dev-packages] 19 | 20 | [requires] 21 | python_version = "3.8" 22 | -------------------------------------------------------------------------------- /projects/drug-drug-interaction/app.py: -------------------------------------------------------------------------------- 1 | # import libraries for web API support 2 | from flask import Flask, request, jsonify 3 | import sklearn 4 | import json 5 | # import the data prediction module 6 | from ddi_lib_rel import predict 7 | 8 | # create a Flask app instance 9 | app = Flask(__name__) 10 | 11 | VERSION = '1.0.0' 12 | LABEL = 'Drug to Drug Interaction Prediction API' 13 | 14 | # define the root endpoint 15 | @app.route('/', methods=['GET']) 16 | # define the root endpoint function 17 | def root(): 18 | return f'{LABEL} {VERSION}' 19 | 20 | print(f"Loading {LABEL} {VERSION}") 21 | print("sklearn version", sklearn.__version__) 22 | 23 | # define the predict endpoint 24 | @app.route('/predict', methods=['POST']) 25 | # define the predict endpoint function 26 | def predict_endpoint(): 27 | print("Predict endpoint called") 28 | # get the request body 29 | data = request.get_json() 30 | 31 | # Parse the JSON string into a dictionary 32 | data_dict = json.loads(data) 33 | 34 | # get the prediction 35 | results = predict(data_dict) 36 | 37 | # return the prediction 38 | return jsonify(results) 39 | 40 | # load the application 41 | if __name__ == '__main__': 42 | app.run(debug=True, host='0.0.0.0', port=8000) 43 | -------------------------------------------------------------------------------- /projects/drug-drug-interaction/data/interaction_types.csv: -------------------------------------------------------------------------------- 1 | Interaction type,Description,Subject,DDI type 2 | 67,#Drug1 can cause a decrease in the absorption of #Drug2 resulting in a reduced serum concentration and potentially a decrease in efficacy.,2,DDI type 1 3 | 18,#Drug1 can cause an increase in the absorption of #Drug2 resulting in an increased serum concentration and potentially a worsening of adverse effects.,1,DDI type 2 4 | 13,The absorption of #Drug2 can be decreased when combined with #Drug1.,1,DDI type 3 5 | 3,The bioavailability of #Drug2 can be decreased when combined with #Drug1.,1,DDI type 4 6 | 62,The bioavailability of #Drug2 can be increased when combined with #Drug1.,1,DDI type 5 7 | 47,The metabolism of #Drug2 can be decreased when combined with #Drug1.,2,DDI type 6 8 | 4,The metabolism of #Drug2 can be increased when combined with #Drug1.,1,DDI type 7 9 | 43,The protein binding of #Drug2 can be decreased when combined with #Drug1.,2,DDI type 8 10 | 75,The serum concentration of #Drug2 can be decreased when it is combined with #Drug1.,1,DDI type 9 11 | 73,The serum concentration of #Drug2 can be increased when it is combined with #Drug1.,2,DDI type 10 12 | 77,The serum concentration of the active metabolites of #Drug2 can be increased when #Drug2 is used in combination with #Drug1.,1,DDI type 11 13 | 11,The serum concentration of the active metabolites of #Drug2 can be reduced when #Drug2 is used in combination with #Drug1 resulting in a loss in efficacy.,1,DDI type 12 14 | 70,The therapeutic efficacy of #Drug2 can be decreased when used in combination with #Drug1.,1,DDI type 13 15 | 8,The therapeutic efficacy of #Drug2 can be increased when used in combination with #Drug1.,1,DDI type 14 16 | 72,#Drug1 may decrease the excretion rate of #Drug2 which could result in a higher serum level.,2,DDI type 15 17 | 65,#Drug1 may increase the excretion rate of #Drug2 which could result in a lower serum level and potentially a reduction in efficacy.,2,DDI type 16 18 | 58,#Drug1 may decrease the cardiotoxic activities of #Drug2.,2,DDI type 17 19 | 15,#Drug1 may increase the cardiotoxic activities of #Drug2.,1,DDI type 18 20 | 44,#Drug1 may increase the central neurotoxic activities of #Drug2.,1,DDI type 19 21 | 80,#Drug1 may increase the hepatotoxic activities of #Drug2.,2,DDI type 20 22 | 57,#Drug1 may increase the nephrotoxic activities of #Drug2.,2,DDI type 21 23 | 35,#Drug1 may increase the neurotoxic activities of #Drug2.,1,DDI type 22 24 | 7,#Drug1 may increase the ototoxic activities of #Drug2.,2,DDI type 23 25 | 45,#Drug1 may decrease effectiveness of #Drug2 as a diagnostic agent.,2,DDI type 24 26 | 86,The risk of a hypersensitivity reaction to #Drug2 is increased when it is combined with #Drug1.,2,DDI type 25 27 | 49,The risk or severity of adverse effects can be increased when #Drug1 is combined with #Drug2.,3,DDI type 26 28 | 66,The risk or severity of bleeding can be increased when #Drug1 is combined with #Drug2.,3,DDI type 27 29 | 50,The risk or severity of heart failure can be increased when #Drug2 is combined with #Drug1.,3,DDI type 28 30 | 42,The risk or severity of hyperkalemia can be increased when #Drug1 is combined with #Drug2.,3,DDI type 29 31 | 31,The risk or severity of hypertension can be increased when #Drug2 is combined with #Drug1.,3,DDI type 30 32 | 56,The risk or severity of hypotension can be increased when #Drug1 is combined with #Drug2.,3,DDI type 31 33 | 33,The risk or severity of QTc prolongation can be increased when #Drug1 is combined with #Drug2.,3,DDI type 32 34 | 52,#Drug1 may decrease the analgesic activities of #Drug2.,2,DDI type 33 35 | 12,#Drug1 may decrease the anticoagulant activities of #Drug2.,2,DDI type 34 36 | 37,#Drug1 may decrease the antihypertensive activities of #Drug2.,2,DDI type 35 37 | 26,#Drug1 may decrease the antiplatelet activities of #Drug2.,1,DDI type 36 38 | 14,#Drug1 may decrease the bronchodilatory activities of #Drug2.,1,DDI type 37 39 | 29,#Drug1 may decrease the diuretic activities of #Drug2.,1,DDI type 38 40 | 17,#Drug1 may decrease the neuromuscular blocking activities of #Drug2.,1,DDI type 39 41 | 76,#Drug1 may decrease the sedative activities of #Drug2.,1,DDI type 40 42 | 61,#Drug1 may decrease the stimulatory activities of #Drug2.,2,DDI type 41 43 | 5,#Drug1 may decrease the vasoconstricting activities of #Drug2.,1,DDI type 42 44 | 22,#Drug1 may increase the adverse neuromuscular activities of #Drug2.,2,DDI type 43 45 | 69,#Drug1 may increase the analgesic activities of #Drug2.,2,DDI type 44 46 | 2,#Drug1 may increase the anticholinergic activities of #Drug2.,2,DDI type 45 47 | 6,#Drug1 may increase the anticoagulant activities of #Drug2.,2,DDI type 46 48 | 10,#Drug1 may increase the antihypertensive activities of #Drug2.,1,DDI type 47 49 | 53,#Drug1 may increase the antiplatelet activities of #Drug2.,2,DDI type 48 50 | 36,#Drug1 may increase the antipsychotic activities of #Drug2.,2,DDI type 49 51 | 82,#Drug1 may increase the arrhythmogenic activities of #Drug2.,2,DDI type 50 52 | 25,#Drug1 may increase the atrioventricular blocking (AV block) activities of #Drug2.,1,DDI type 51 53 | 54,#Drug1 may increase the bradycardic activities of #Drug2.,1,DDI type 52 54 | 46,#Drug1 may increase the bronchoconstrictory activities of #Drug2.,1,DDI type 53 55 | 16,#Drug1 may increase the central nervous system depressant (CNS depressant) activities of #Drug2.,2,DDI type 54 56 | 79,#Drug1 may increase the central nervous system depressant (CNS depressant) and hypertensive activities of #Drug2.,1,DDI type 55 57 | 39,#Drug1 may increase the constipating activities of #Drug2.,1,DDI type 56 58 | 28,#Drug1 may increase the dermatologic adverse activities of #Drug2.,2,DDI type 57 59 | 74,#Drug1 may increase the fluid retaining activities of #Drug2.,1,DDI type 58 60 | 51,#Drug1 may increase the hypercalcemic activities of #Drug2.,2,DDI type 59 61 | 78,#Drug1 may increase the hyperglycemic activities of #Drug2.,2,DDI type 60 62 | 68,#Drug1 may increase the hyperkalemic activities of #Drug2.,2,DDI type 61 63 | 71,#Drug1 may increase the hypertensive activities of #Drug2.,2,DDI type 62 64 | 24,#Drug1 may increase the hypocalcemic activities of #Drug2.,2,DDI type 63 65 | 9,#Drug1 may increase the hypoglycemic activities of #Drug2.,2,DDI type 64 66 | 83,#Drug1 may increase the hypokalemic activities of #Drug2.,1,DDI type 65 67 | 55,#Drug1 may increase the hyponatremic activities of #Drug2.,2,DDI type 66 68 | 60,#Drug1 may increase the hypotensive activities of #Drug2.,1,DDI type 67 69 | 41,#Drug1 may increase the hypotensive and central nervous system depressant (CNS depressant) activities of #Drug2.,1,DDI type 68 70 | 34,#Drug1 may increase the immunosuppressive activities of #Drug2.,1,DDI type 69 71 | 63,#Drug1 may increase the myelosuppressive activities of #Drug2.,2,DDI type 70 72 | 48,#Drug1 may increase the myopathic rhabdomyolysis activities of #Drug2.,2,DDI type 71 73 | 27,#Drug1 may increase the neuroexcitatory activities of #Drug2.,1,DDI type 72 74 | 21,#Drug1 may increase the neuromuscular blocking activities of #Drug2.,2,DDI type 73 75 | 30,#Drug1 may increase the orthostatic hypotensive activities of #Drug2.,1,DDI type 74 76 | 1,#Drug1 may increase the photosensitizing activities of #Drug2.,1,DDI type 75 77 | 20,#Drug1 may increase the QTc-prolonging activities of #Drug2.,2,DDI type 76 78 | 40,#Drug1 may increase the respiratory depressant activities of #Drug2.,2,DDI type 77 79 | 32,#Drug1 may increase the sedative activities of #Drug2.,1,DDI type 78 80 | 64,#Drug1 may increase the serotonergic activities of #Drug2.,2,DDI type 79 81 | 23,#Drug1 may increase the stimulatory activities of #Drug2.,2,DDI type 80 82 | 85,#Drug1 may increase the tachycardic activities of #Drug2.,1,DDI type 81 83 | 81,#Drug1 may increase the thrombogenic activities of #Drug2.,2,DDI type 82 84 | 59,#Drug1 may increase the ulcerogenic activities of #Drug2.,1,DDI type 83 85 | 19,#Drug1 may increase the vasoconstricting activities of #Drug2.,2,DDI type 84 86 | 38,#Drug1 may increase the vasodilatory activities of #Drug2.,2,DDI type 85 87 | 84,#Drug1 may increase the vasopressor activities of #Drug2.,1,DDI type 86 -------------------------------------------------------------------------------- /projects/drug-drug-interaction/data/pharma.7z: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ozkary/machine-learning-engineering/ed19b242fd476d9e8ad6f80385e75241acbc2c49/projects/drug-drug-interaction/data/pharma.7z -------------------------------------------------------------------------------- /projects/drug-drug-interaction/data/test_cases.csv: -------------------------------------------------------------------------------- 1 | Prescription,Drug name,Smiles 2 | 839,Ritonavir,CC(C)[C@H](NC(=O)N(C)CC1=CSC(=N1)C(C)C)C(=O)N[C@H](C[C@H](O)[C@H](CC1=CC=CC=C1)NC(=O)OCC1=CN=CS1)CC1=CC=CC=C1 3 | 839,Formoterol,COC1=CC=C(CC(C)NCC(O)C2=CC(NC=O)=C(O)C=C2)C=C1 4 | 1781,Ritonavir,CC(C)[C@H](NC(=O)N(C)CC1=CSC(=N1)C(C)C)C(=O)N[C@H](C[C@H](O)[C@H](CC1=CC=CC=C1)NC(=O)OCC1=CN=CS1)CC1=CC=CC=C1 5 | 1781,Olodaterol,COC1=CC=C(CC(C)(C)NC[C@H](O)C2=C3OCC(=O)NC3=CC(O)=C2)C=C1 6 | 61,Phentermine,CC(C)(N)CC1=CC=CC=C1 7 | 61,Brexpiprazole,O=C1NC2=CC(OCCCCN3CCN(CC3)C3=C4C=CSC4=CC=C3)=CC=C2C=C1 8 | 85,Mirtazapine,CN1CCN2C(C1)C1=CC=CC=C1CC1=C2N=CC=C1 9 | 85,Phenylephrine,CNC[C@H](O)C1=CC(O)=CC=C1 -------------------------------------------------------------------------------- /projects/drug-drug-interaction/data_control.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Drug to Drug Interaction - Control File Processor\n", 8 | "\n", 9 | "> Use this file to create a control files from the data source\n", 10 | "\n", 11 | "Control files are used to test the models." 12 | ] 13 | }, 14 | { 15 | "cell_type": "code", 16 | "execution_count": null, 17 | "metadata": {}, 18 | "outputs": [], 19 | "source": [ 20 | "import numpy as np\n", 21 | "import pandas as pd\n", 22 | "\n", 23 | "df = pd.read_csv('./data/Neuron_input.csv')" 24 | ] 25 | }, 26 | { 27 | "cell_type": "code", 28 | "execution_count": null, 29 | "metadata": {}, 30 | "outputs": [], 31 | "source": [ 32 | "# select 5 rows from the dataframe with unique Y values and keep the first row of each group\n", 33 | "df_group = df.groupby('Y', as_index=False).first().head(500)\n", 34 | "# print(df.head())\n", 35 | "\n", 36 | "# load all the valid drug names from a tab delimited file\n", 37 | "df_drugs = pd.read_csv('./data/Approved_drug_Information.txt', sep='\\t', header=None)\n", 38 | "\n", 39 | "# join df (Drug1_ID, Drug2_ID) with df_drugs (0) to get the drug names \n", 40 | "df_drugs_names = df_drugs[[0, 1]]\n", 41 | "df_drugs_names.columns = ['drug_id', 'name']\n", 42 | "\n", 43 | "# join two dataframes df (Drug1_ID, Drug2_ID) with df_drugs_names drug_id to get the drug name\n", 44 | "df_drugs_names = df_drugs_names.set_index('drug_id')\n", 45 | "\n", 46 | "# join the drug names with the drug1_id and drug2_id and rename the columns to avoid conflicts \n", 47 | "df_join = df_group.join(df_drugs_names, on='Drug1_ID', rsuffix='_1')\n", 48 | "df_join = df_join.join(df_drugs_names, on='Drug2_ID', rsuffix='_2')\n", 49 | "df_join.rename(columns={'name': 'name_1'}, inplace=True)\n", 50 | "print(df_join.columns)\n" 51 | ] 52 | }, 53 | { 54 | "cell_type": "code", 55 | "execution_count": null, 56 | "metadata": {}, 57 | "outputs": [], 58 | "source": [ 59 | "\n", 60 | "# remove all the rows with null and None values\n", 61 | "df_cleaned = df_join.dropna()\n", 62 | "print(df_cleaned.head())\n", 63 | "\n", 64 | "# select all rows with valid name and name_2 \n", 65 | "df_cleaned = df_cleaned.loc[(df_cleaned['name_1'].notnull()) & (df_cleaned['name_2'].notnull())]\n", 66 | "print(df_cleaned.head())\n" 67 | ] 68 | }, 69 | { 70 | "cell_type": "code", 71 | "execution_count": null, 72 | "metadata": {}, 73 | "outputs": [], 74 | "source": [ 75 | "df_unique_set = df_cleaned.drop_duplicates()\n", 76 | "\n", 77 | "# set all columns to lowercase and replace spaces with underscore\n", 78 | "df_unique_set.columns = map(str.lower, df_unique_set.columns)\n", 79 | "\n", 80 | "#rename y column to ddi_type\n", 81 | "df_unique_set.rename(columns={'y': 'ddi_type'}, inplace=True)\n", 82 | "\n", 83 | "print(df_unique_set.head(10))\n", 84 | "\n", 85 | "# get a set of test cases using only name_1, drug1, drug2 and name_2\n", 86 | "df_test_cases = df_unique_set[['name_1', 'drug1', 'drug2', 'name_2']]\n", 87 | "df_test_cases.columns = ['drug1', 'smiles_1', 'drug2', 'smiles_2']\n", 88 | "\n", 89 | "#save to csv\n", 90 | "df_test_cases.to_csv('./data/test_cases_complete.csv', index=False)\n" 91 | ] 92 | }, 93 | { 94 | "cell_type": "code", 95 | "execution_count": null, 96 | "metadata": {}, 97 | "outputs": [], 98 | "source": [ 99 | "# calculate the ssp for each row and add a new column to the dataframe\n", 100 | "\n", 101 | "# df_unique_set['ssp'] = df_unique_set.copy().apply(lambda row: predictor.calculate_ssp(row['drug1'], row['drug2']), axis=1)\n", 102 | "# add the ssp column to the dataframe\n", 103 | "\n", 104 | "# print(df_unique_set.head(10))\n", 105 | "\n", 106 | "# select all the rows with ssp > 0\n", 107 | "df_ssp = df_unique_set.loc[df_unique_set['ssp'] > 0]\n", 108 | "print(df_ssp.head(10))\n", 109 | "\n", 110 | "# save this file to a csv file\n", 111 | "df_ssp.to_csv('./data/control_features.csv', index=False)\n" 112 | ] 113 | } 114 | ], 115 | "metadata": { 116 | "kernelspec": { 117 | "display_name": "Python 3", 118 | "language": "python", 119 | "name": "python3" 120 | }, 121 | "language_info": { 122 | "name": "python", 123 | "version": "3.8.10" 124 | } 125 | }, 126 | "nbformat": 4, 127 | "nbformat_minor": 2 128 | } 129 | -------------------------------------------------------------------------------- /projects/drug-drug-interaction/ddi_lib/__init__.py: -------------------------------------------------------------------------------- 1 | # your_package/__init__.py 2 | from .__version__ import __version__ 3 | from .data_train import DDIProcessData, DDIModelFactory 4 | from .data_train_mlp import DDIMLPFactory, DDIProcessor 5 | from .data_predict import DDIPredictor, DDIModelLoader, predict, load_test_cases 6 | 7 | print(f"Initializing drug-drug interaction ddi_lib {__version__}") -------------------------------------------------------------------------------- /projects/drug-drug-interaction/ddi_lib/__version__.py: -------------------------------------------------------------------------------- 1 | # your_package/__version__.py 2 | 3 | __version__ = "0.1.0" # Replace with your desired version number 4 | -------------------------------------------------------------------------------- /projects/drug-drug-interaction/ddi_lib/data_predict.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # coding: utf-8 3 | 4 | # Drug to Drug Interaction - Predicting the interaction 5 | 6 | import os 7 | import numpy as np 8 | import pandas as pd 9 | import sklearn 10 | import pickle 11 | from rdkit import Chem 12 | from rdkit.Chem import AllChem 13 | 14 | 15 | class DDIModelLoader(): 16 | """ 17 | Class to load a model from a pickle file and make predictions 18 | """ 19 | def __init__(self, model_path, encoder_path=None): 20 | self.model_path = model_path 21 | self.encoder_path = encoder_path 22 | self.model = None 23 | self.encoder = None 24 | self.load_model() 25 | 26 | def load_model(self): 27 | """ 28 | Load the model from the pickle file 29 | Load encoder if use_encoder is set 30 | """ 31 | 32 | with open(self.model_path, 'rb') as model: 33 | self.model = pickle.load(model) 34 | 35 | if self.encoder_path is not None: 36 | with open(self.encoder_path, 'rb') as encoder: 37 | self.encoder = pickle.load(encoder) 38 | 39 | def predict(self, X): 40 | 41 | # Transform the data 42 | # X_encoded = self.encoder.transform(X) 43 | 44 | # Predict the results 45 | y_pred = self.model.predict(X) 46 | return y_pred 47 | 48 | 49 | class DDIPredictor: 50 | """ 51 | Maps the predictions to the original labels and meaning 52 | """ 53 | def __init__(self, model_path, encoder_path, data_path): 54 | self.model = DDIModelLoader(model_path, encoder_path) 55 | self.interactions = None 56 | self.data_path = data_path 57 | self.drug_pca_lookup = None 58 | self.load_drug_pca_lookup() 59 | 60 | def load_drug_pca_lookup(self): 61 | """ 62 | Load the drug pca lookup from pickle file 63 | """ 64 | if self.drug_pca_lookup is None: 65 | pca_drugs = pd.read_csv(f'{self.data_path}drugbank_pca50.csv.gz',index_col=0) 66 | pca_drugs['name'] = pca_drugs['name'].str.lower() 67 | # convert drugs to a dictionary using name in lowercase as the key 68 | self.drug_pca_lookup = pca_drugs.set_index('name').T.to_dict('list') 69 | 70 | def predict(self, X): 71 | """ 72 | Predict the results and map them to the original labels 73 | """ 74 | return self.model.predict(X) 75 | 76 | def feature_names(self): 77 | """ 78 | Return the feature names 79 | """ 80 | # if the property feature name exists, return it 81 | if hasattr(self.model.model, 'feature_names'): 82 | return self.model.feature_names 83 | 84 | return None 85 | 86 | def build_model_input(self, rxs, features=101): 87 | """ 88 | Build the message to be returned using a pca lookup 89 | """ 90 | # select the mean of the pc_ columns from the dict_drugs 91 | mean_pca = np.mean(list(self.drug_pca_lookup.values()), axis=0).tolist() 92 | 93 | #for each unique drug name in rxs do a lookup to get the pca values 94 | inputs = [] 95 | for rx in rxs: 96 | # from the dict get all the keys with the drug names 97 | pca_drug = [] 98 | for label in ['drug1','drug2']: 99 | name = rx[label].lower() 100 | pca = self.drug_pca_lookup[name] if name in self.drug_pca_lookup else mean_pca 101 | pca_drug = pca_drug + pca 102 | pca_drug = pca_drug + [rx['ssp']] 103 | inputs = inputs + [pca_drug] 104 | 105 | # convert the list to a numpy array to see the shape 106 | X = np.array(inputs) 107 | print(X.shape) 108 | 109 | return X 110 | 111 | def build_model_message(self, ssp_values, features=101): 112 | """ 113 | Build the message to be returned 114 | """ 115 | 116 | X = np.zeros((len(ssp_values), features)) 117 | for index, ssp in enumerate(ssp_values): 118 | X[index,-1] = ssp 119 | 120 | print(X.shape) 121 | return X 122 | 123 | def get_ddi_description(self, results): 124 | # check if the list is empty 125 | 126 | if len(results) == 0: 127 | return None 128 | 129 | if self.interactions is None: 130 | self.load_interactions() 131 | 132 | notes = [] 133 | for result in results: 134 | # select the row with ddi_type = result 135 | ddi_type = int(result['result']) + 1 136 | ddi_row = self.interactions.loc[self.interactions['ddi_type'] == ddi_type] 137 | 138 | # add one to the result to match the encoding during training 139 | note = f'No interaction found for {result["drug1"]} and {result["drug2"]}' 140 | if ddi_row is not None and len(ddi_row) > 0: 141 | note = ddi_row['description'].to_string(index=False).replace('#Drug1', result['drug1']).replace('#Drug2', result['drug2']) 142 | notes.append(note) 143 | 144 | return notes 145 | 146 | def load_interactions(self): 147 | """ 148 | Load the interactions from csv 149 | """ 150 | df = pd.read_csv(f'{self.data_path}interaction_types.csv') 151 | 152 | #rename the columns to lowercase and replace spaces with underscore 153 | df.columns = map(str.lower, df.columns) 154 | df.columns = df.columns.str.replace(' ', '_') 155 | 156 | # remove the DDI type text from the ddi_type column 157 | df['ddi_type'] = df['ddi_type'].str.replace('DDI type ', '') 158 | 159 | # cast the ddi_type column to integer 160 | df['ddi_type'] = df['ddi_type'].astype(int) 161 | self.interactions = df 162 | 163 | def calculate_ssp(self, smiles_drug1, smiles_drug2): 164 | 165 | """ 166 | Structural Similarity Profile (SSP) for drug pairs 167 | """ 168 | 169 | # check if the SMILE code is valid 170 | if smiles_drug1 is None or smiles_drug2 is None: 171 | return 0 172 | 173 | try: 174 | mol_drug1 = Chem.MolFromSmiles(smiles_drug1) 175 | mol_drug2 = Chem.MolFromSmiles(smiles_drug2) 176 | 177 | fp_drug1 = AllChem.GetMorganFingerprintAsBitVect(mol_drug1, 2, nBits=1024) 178 | fp_drug2 = AllChem.GetMorganFingerprintAsBitVect(mol_drug2, 2, nBits=1024) 179 | 180 | array_fp_drug1 = np.array(list(fp_drug1.ToBitString())).astype(int) 181 | array_fp_drug2 = np.array(list(fp_drug2.ToBitString())).astype(int) 182 | 183 | tanimoto_similarity = np.sum(np.logical_and(array_fp_drug1, array_fp_drug2)) / np.sum(np.logical_or(array_fp_drug1, array_fp_drug2)) 184 | 185 | return tanimoto_similarity 186 | except: 187 | return 0 188 | 189 | 190 | def load_test_cases(): 191 | """ 192 | Load the test cases from the csv file 193 | """ 194 | df_test_cases = pd.read_csv('./data/test_cases.csv') 195 | # make all columns lowercase and replace spaces with underscores 196 | df_test_cases.columns = [col.lower().replace(' ', '_') for col in df_test_cases.columns] 197 | 198 | # convert the data into a drug pair using the prescription column 199 | prescriptions = {} 200 | for index, row in df_test_cases.iterrows(): 201 | rx = row['prescription'] 202 | if rx not in prescriptions: 203 | prescriptions[rx] = {} 204 | 205 | # get the key count to start bulding the properties 206 | key_count = 1 if len(prescriptions[rx]) == 0 else 2 207 | drug = f'drug{key_count}' 208 | smile = f'smiles{key_count}' 209 | prescriptions[rx][drug] = row['drug_name'] 210 | prescriptions[rx][smile] = row['smiles'] 211 | 212 | return prescriptions 213 | 214 | def predict(data): 215 | """ 216 | Predict the DDI for the given data 217 | """ 218 | 219 | # load the model 220 | model_file = './models/ozkary_ddi_xgboost.pkl.bin' 221 | encoder_file = './models/ozkary_ddi_encoder.pkl.bin' 222 | data_path = './data/' 223 | 224 | predictor = DDIPredictor(model_file, encoder_file, data_path) 225 | 226 | prescriptions = data 227 | 228 | # for each drug pair calculate the structural similarity profile 229 | for rx in prescriptions: 230 | drug1 = prescriptions[rx]['smiles1'] 231 | drug2 = prescriptions[rx]['smiles2'] 232 | prescriptions[rx]['ssp'] = predictor.calculate_ssp(drug1, drug2) 233 | 234 | print(prescriptions.values()) 235 | 236 | # select all the ssp values from the list 237 | ssp_values = [item['ssp'] for item in prescriptions.values()] 238 | # X = predictor.build_model_message(ssp_values) 239 | X = predictor.build_model_input(prescriptions.values()) 240 | 241 | # run a prediction 242 | y_pred = predictor.predict(X) 243 | print('Predictions ', y_pred) 244 | # for each y_pred value add it to the dictionary 245 | for index, item in enumerate(prescriptions.values()): 246 | item['result'] = y_pred[index] 247 | 248 | # print the results 249 | print(prescriptions.values()) 250 | 251 | # load the interactions types file 252 | result = predictor.get_ddi_description(prescriptions.values()) 253 | 254 | return result 255 | 256 | 257 | # add a main function for the entry point to the program 258 | if __name__ == '__main__': 259 | os.system('clear') 260 | print('Running DDI main function') 261 | 262 | prescriptions = load_test_cases() 263 | results = predict(prescriptions) 264 | for result in results: 265 | print(result) 266 | -------------------------------------------------------------------------------- /projects/drug-drug-interaction/ddi_lib_rel/__init__.py: -------------------------------------------------------------------------------- 1 | # your_package/__init__.py 2 | from .__version__ import __version__ 3 | from .data_predict import DDIPredictor, DDIModelLoader, predict, load_test_cases 4 | 5 | print(f"Initializing drug-drug interaction ddi_lib {__version__}") -------------------------------------------------------------------------------- /projects/drug-drug-interaction/ddi_lib_rel/__version__.py: -------------------------------------------------------------------------------- 1 | # your_package/__version__.py 2 | 3 | __version__ = "0.1.0" # Replace with your desired version number 4 | -------------------------------------------------------------------------------- /projects/drug-drug-interaction/ddi_lib_rel/data_predict.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # coding: utf-8 3 | 4 | # Drug to Drug Interaction - Predicting the interaction 5 | 6 | import os 7 | import numpy as np 8 | import pandas as pd 9 | import sklearn 10 | import pickle 11 | from rdkit import Chem 12 | from rdkit.Chem import AllChem 13 | 14 | 15 | class DDIModelLoader(): 16 | """ 17 | Class to load a model from a pickle file and make predictions 18 | """ 19 | def __init__(self, model_path, encoder_path=None): 20 | self.model_path = model_path 21 | self.encoder_path = encoder_path 22 | self.model = None 23 | self.encoder = None 24 | self.load_model() 25 | 26 | def load_model(self): 27 | """ 28 | Load the model from the pickle file 29 | Load encoder if use_encoder is set 30 | """ 31 | 32 | with open(self.model_path, 'rb') as model: 33 | self.model = pickle.load(model) 34 | 35 | if self.encoder_path is not None: 36 | with open(self.encoder_path, 'rb') as encoder: 37 | self.encoder = pickle.load(encoder) 38 | 39 | def predict(self, X): 40 | 41 | # Transform the data 42 | # X_encoded = self.encoder.transform(X) 43 | 44 | # Predict the results 45 | y_pred = self.model.predict(X) 46 | return y_pred 47 | 48 | 49 | class DDIPredictor: 50 | """ 51 | Maps the predictions to the original labels and meaning 52 | """ 53 | def __init__(self, model_path, encoder_path, data_path): 54 | self.model = DDIModelLoader(model_path, encoder_path) 55 | self.interactions = None 56 | self.data_path = data_path 57 | self.drug_pca_lookup = None 58 | self.load_drug_pca_lookup() 59 | 60 | def load_drug_pca_lookup(self): 61 | """ 62 | Load the drug pca lookup from pickle file 63 | """ 64 | if self.drug_pca_lookup is None: 65 | pca_drugs = pd.read_csv(f'{self.data_path}drugbank_pca50.csv.gz',index_col=0) 66 | pca_drugs['name'] = pca_drugs['name'].str.lower() 67 | # convert drugs to a dictionary using name in lowercase as the key 68 | self.drug_pca_lookup = pca_drugs.set_index('name').T.to_dict('list') 69 | 70 | def predict(self, X): 71 | """ 72 | Predict the results and map them to the original labels 73 | """ 74 | return self.model.predict(X) 75 | 76 | def feature_names(self): 77 | """ 78 | Return the feature names 79 | """ 80 | # if the property feature name exists, return it 81 | if hasattr(self.model.model, 'feature_names'): 82 | return self.model.feature_names 83 | 84 | return None 85 | 86 | def build_model_input(self, rxs, features=101): 87 | """ 88 | Build the message to be returned using a pca lookup 89 | """ 90 | # select the mean of the pc_ columns from the dict_drugs 91 | mean_pca = np.mean(list(self.drug_pca_lookup.values()), axis=0).tolist() 92 | 93 | #for each unique drug name in rxs do a lookup to get the pca values 94 | inputs = [] 95 | for rx in rxs: 96 | # from the dict get all the keys with the drug names 97 | pca_drug = [] 98 | for label in ['drug1','drug2']: 99 | name = rx[label].lower() 100 | pca = self.drug_pca_lookup[name] if name in self.drug_pca_lookup else mean_pca 101 | pca_drug = pca_drug + pca 102 | pca_drug = pca_drug + [rx['ssp']] 103 | inputs = inputs + [pca_drug] 104 | 105 | # convert the list to a numpy array to see the shape 106 | X = np.array(inputs) 107 | print(X.shape) 108 | 109 | return X 110 | 111 | def build_model_message(self, ssp_values, features=101): 112 | """ 113 | Build the message to be returned 114 | """ 115 | 116 | X = np.zeros((len(ssp_values), features)) 117 | for index, ssp in enumerate(ssp_values): 118 | X[index,-1] = ssp 119 | 120 | print(X.shape) 121 | return X 122 | 123 | def get_ddi_description(self, results): 124 | # check if the list is empty 125 | 126 | if len(results) == 0: 127 | return None 128 | 129 | if self.interactions is None: 130 | self.load_interactions() 131 | 132 | notes = [] 133 | for result in results: 134 | # select the row with ddi_type = result 135 | ddi_type = int(result['result']) + 1 136 | ddi_row = self.interactions.loc[self.interactions['ddi_type'] == ddi_type] 137 | 138 | # add one to the result to match the encoding during training 139 | note = f'No interaction found for {result["drug1"]} and {result["drug2"]}' 140 | if ddi_row is not None and len(ddi_row) > 0: 141 | note = ddi_row['description'].to_string(index=False).replace('#Drug1', result['drug1']).replace('#Drug2', result['drug2']) 142 | notes.append(note) 143 | 144 | return notes 145 | 146 | def load_interactions(self): 147 | """ 148 | Load the interactions from csv 149 | """ 150 | df = pd.read_csv(f'{self.data_path}interaction_types.csv') 151 | 152 | #rename the columns to lowercase and replace spaces with underscore 153 | df.columns = map(str.lower, df.columns) 154 | df.columns = df.columns.str.replace(' ', '_') 155 | 156 | # remove the DDI type text from the ddi_type column 157 | df['ddi_type'] = df['ddi_type'].str.replace('DDI type ', '') 158 | 159 | # cast the ddi_type column to integer 160 | df['ddi_type'] = df['ddi_type'].astype(int) 161 | self.interactions = df 162 | 163 | def calculate_ssp(self, smiles_drug1, smiles_drug2): 164 | 165 | """ 166 | Structural Similarity Profile (SSP) for drug pairs 167 | """ 168 | 169 | # check if the SMILE code is valid 170 | if smiles_drug1 is None or smiles_drug2 is None: 171 | return 0 172 | 173 | try: 174 | mol_drug1 = Chem.MolFromSmiles(smiles_drug1) 175 | mol_drug2 = Chem.MolFromSmiles(smiles_drug2) 176 | 177 | fp_drug1 = AllChem.GetMorganFingerprintAsBitVect(mol_drug1, 2, nBits=1024) 178 | fp_drug2 = AllChem.GetMorganFingerprintAsBitVect(mol_drug2, 2, nBits=1024) 179 | 180 | array_fp_drug1 = np.array(list(fp_drug1.ToBitString())).astype(int) 181 | array_fp_drug2 = np.array(list(fp_drug2.ToBitString())).astype(int) 182 | 183 | tanimoto_similarity = np.sum(np.logical_and(array_fp_drug1, array_fp_drug2)) / np.sum(np.logical_or(array_fp_drug1, array_fp_drug2)) 184 | 185 | return tanimoto_similarity 186 | except: 187 | return 0 188 | 189 | 190 | def load_test_cases(): 191 | """ 192 | Load the test cases from the csv file 193 | """ 194 | df_test_cases = pd.read_csv('./data/test_cases.csv') 195 | # make all columns lowercase and replace spaces with underscores 196 | df_test_cases.columns = [col.lower().replace(' ', '_') for col in df_test_cases.columns] 197 | 198 | # convert the data into a drug pair using the prescription column 199 | prescriptions = {} 200 | for index, row in df_test_cases.iterrows(): 201 | rx = row['prescription'] 202 | if rx not in prescriptions: 203 | prescriptions[rx] = {} 204 | 205 | # get the key count to start bulding the properties 206 | key_count = 1 if len(prescriptions[rx]) == 0 else 2 207 | drug = f'drug{key_count}' 208 | smile = f'smiles{key_count}' 209 | prescriptions[rx][drug] = row['drug_name'] 210 | prescriptions[rx][smile] = row['smiles'] 211 | 212 | return prescriptions 213 | 214 | def predict(data): 215 | """ 216 | Predict the DDI for the given data 217 | """ 218 | 219 | # load the model 220 | model_file = './models/ozkary_ddi_xgboost.pkl.bin' 221 | encoder_file = './models/ozkary_ddi_encoder.pkl.bin' 222 | data_path = './data/' 223 | 224 | predictor = DDIPredictor(model_file, encoder_file, data_path) 225 | 226 | prescriptions = data 227 | 228 | # for each drug pair calculate the structural similarity profile 229 | for rx in prescriptions: 230 | drug1 = prescriptions[rx]['smiles1'] 231 | drug2 = prescriptions[rx]['smiles2'] 232 | prescriptions[rx]['ssp'] = predictor.calculate_ssp(drug1, drug2) 233 | 234 | print(prescriptions.values()) 235 | 236 | # select all the ssp values from the list 237 | ssp_values = [item['ssp'] for item in prescriptions.values()] 238 | # X = predictor.build_model_message(ssp_values) 239 | X = predictor.build_model_input(prescriptions.values()) 240 | 241 | # run a prediction 242 | y_pred = predictor.predict(X) 243 | print('Predictions ', y_pred) 244 | # for each y_pred value add it to the dictionary 245 | for index, item in enumerate(prescriptions.values()): 246 | item['result'] = y_pred[index] 247 | 248 | # print the results 249 | print(prescriptions.values()) 250 | 251 | # load the interactions types file 252 | result = predictor.get_ddi_description(prescriptions.values()) 253 | 254 | return result 255 | 256 | 257 | # add a main function for the entry point to the program 258 | if __name__ == '__main__': 259 | os.system('clear') 260 | print('Running DDI main function') 261 | 262 | prescriptions = load_test_cases() 263 | results = predict(prescriptions) 264 | for result in results: 265 | print(result) 266 | -------------------------------------------------------------------------------- /projects/drug-drug-interaction/images/ozkary-interaction-type-class-balance.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ozkary/machine-learning-engineering/ed19b242fd476d9e8ad6f80385e75241acbc2c49/projects/drug-drug-interaction/images/ozkary-interaction-type-class-balance.png -------------------------------------------------------------------------------- /projects/drug-drug-interaction/images/ozkary-interaction-type-distribution.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ozkary/machine-learning-engineering/ed19b242fd476d9e8ad6f80385e75241acbc2c49/projects/drug-drug-interaction/images/ozkary-interaction-type-distribution.png -------------------------------------------------------------------------------- /projects/drug-drug-interaction/images/ozkary-ml-ddi-model-confusion-matrix.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ozkary/machine-learning-engineering/ed19b242fd476d9e8ad6f80385e75241acbc2c49/projects/drug-drug-interaction/images/ozkary-ml-ddi-model-confusion-matrix.png -------------------------------------------------------------------------------- /projects/drug-drug-interaction/images/ozkary-ml-ddi-model-evaluation.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ozkary/machine-learning-engineering/ed19b242fd476d9e8ad6f80385e75241acbc2c49/projects/drug-drug-interaction/images/ozkary-ml-ddi-model-evaluation.png -------------------------------------------------------------------------------- /projects/drug-drug-interaction/images/ozkary-mlp-model3_accuracy.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ozkary/machine-learning-engineering/ed19b242fd476d9e8ad6f80385e75241acbc2c49/projects/drug-drug-interaction/images/ozkary-mlp-model3_accuracy.png -------------------------------------------------------------------------------- /projects/drug-drug-interaction/images/ozkary-mlp-model3_loss.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ozkary/machine-learning-engineering/ed19b242fd476d9e8ad6f80385e75241acbc2c49/projects/drug-drug-interaction/images/ozkary-mlp-model3_loss.png -------------------------------------------------------------------------------- /projects/drug-drug-interaction/images/ozkary-mlp-neural-network1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ozkary/machine-learning-engineering/ed19b242fd476d9e8ad6f80385e75241acbc2c49/projects/drug-drug-interaction/images/ozkary-mlp-neural-network1.png -------------------------------------------------------------------------------- /projects/drug-drug-interaction/images/ozkary-mlp-neural-network2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ozkary/machine-learning-engineering/ed19b242fd476d9e8ad6f80385e75241acbc2c49/projects/drug-drug-interaction/images/ozkary-mlp-neural-network2.png -------------------------------------------------------------------------------- /projects/drug-drug-interaction/images/ozkary-mlp-neural-network3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ozkary/machine-learning-engineering/ed19b242fd476d9e8ad6f80385e75241acbc2c49/projects/drug-drug-interaction/images/ozkary-mlp-neural-network3.png -------------------------------------------------------------------------------- /projects/drug-drug-interaction/images/ozkary-pca-feature-importance.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ozkary/machine-learning-engineering/ed19b242fd476d9e8ad6f80385e75241acbc2c49/projects/drug-drug-interaction/images/ozkary-pca-feature-importance.png -------------------------------------------------------------------------------- /projects/drug-drug-interaction/images/ozkary-predicting-drug-drug-interactions-with-ai.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ozkary/machine-learning-engineering/ed19b242fd476d9e8ad6f80385e75241acbc2c49/projects/drug-drug-interaction/images/ozkary-predicting-drug-drug-interactions-with-ai.jpg -------------------------------------------------------------------------------- /projects/drug-drug-interaction/models/ozkary-ddi.h5: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ozkary/machine-learning-engineering/ed19b242fd476d9e8ad6f80385e75241acbc2c49/projects/drug-drug-interaction/models/ozkary-ddi.h5 -------------------------------------------------------------------------------- /projects/drug-drug-interaction/models/ozkary_ddi_encoder.pkl.bin: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ozkary/machine-learning-engineering/ed19b242fd476d9e8ad6f80385e75241acbc2c49/projects/drug-drug-interaction/models/ozkary_ddi_encoder.pkl.bin -------------------------------------------------------------------------------- /projects/drug-drug-interaction/models/ozkary_ddi_xgboost.pkl.bin: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ozkary/machine-learning-engineering/ed19b242fd476d9e8ad6f80385e75241acbc2c49/projects/drug-drug-interaction/models/ozkary_ddi_xgboost.pkl.bin -------------------------------------------------------------------------------- /projects/drug-drug-interaction/ozkary-ai-ddi/.gitignore: -------------------------------------------------------------------------------- 1 | bin 2 | obj 3 | csx 4 | .vs 5 | edge 6 | Publish 7 | 8 | *.user 9 | *.suo 10 | *.cscfg 11 | *.Cache 12 | project.lock.json 13 | 14 | /packages 15 | /TestResults 16 | 17 | /tools/NuGet.exe 18 | /App_Data 19 | /secrets 20 | /data 21 | .secrets 22 | appsettings.json 23 | local.settings.json 24 | 25 | node_modules 26 | dist 27 | 28 | # Local python packages 29 | .python_packages/ 30 | 31 | # Python Environments 32 | .env 33 | .venv 34 | env/ 35 | venv/ 36 | ENV/ 37 | env.bak/ 38 | venv.bak/ 39 | 40 | # Byte-compiled / optimized / DLL files 41 | __pycache__/ 42 | *.py[cod] 43 | *$py.class 44 | 45 | # Azurite artifacts 46 | __blobstorage__ 47 | __queuestorage__ 48 | __azurite_db*__.json -------------------------------------------------------------------------------- /projects/drug-drug-interaction/ozkary-ai-ddi/.vscode/extensions.json: -------------------------------------------------------------------------------- 1 | { 2 | "recommendations": [ 3 | "ms-azuretools.vscode-azurefunctions" 4 | ] 5 | } -------------------------------------------------------------------------------- /projects/drug-drug-interaction/ozkary-ai-ddi/Dockerfile: -------------------------------------------------------------------------------- 1 | # Use the official Azure Functions Python image 2 | FROM mcr.microsoft.com/azure-functions/python:3.0-python3.8 3 | 4 | # Set the working directory to your function app 5 | WORKDIR /home/site/wwwroot 6 | 7 | # Copy the Azure Functions code and configuration to the container 8 | COPY . . 9 | 10 | # Install dependencies for each function 11 | 12 | # for multiple functions add the folder names separated by space 13 | # RUN for func_folder in predict train validate; do \ 14 | 15 | RUN for func_folder in predict; do \ 16 | cd $func_folder && \ 17 | pip install --upgrade pip && \ 18 | pip install -r requirements.txt && \ 19 | cd ..; \ 20 | done 21 | 22 | # Expose port 80 for the Azure Functions runtime 23 | EXPOSE 80 24 | 25 | # Start the Azure Functions runtime 26 | CMD ["azure-functions", "start"] 27 | -------------------------------------------------------------------------------- /projects/drug-drug-interaction/ozkary-ai-ddi/azure-deploy.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | # Variables 4 | resourceGroupName="dev-ai-ml-group" 5 | storageAccountName="devaimlstorage" 6 | functionAppName="ozkary-ai-ddi" 7 | location="EastUS2" 8 | 9 | # Create a resource group 10 | az group create --name $resourceGroupName --location $location 11 | 12 | # Create a storage account 13 | az storage account create --name $storageAccountName --resource-group $resourceGroupName --location $location --sku Standard_LRS 14 | 15 | # Create a function app 16 | az functionapp create --name $functionAppName --resource-group $resourceGroupName --consumption-plan-location $location --runtime python --runtime-version 3.8 --storage-account $storageAccountName --os-type Linux --functions-version 3 17 | 18 | # Retrieve the storage account connection string 19 | connectionString=$(az storage account show-connection-string --name $storageAccountName --resource-group $resourceGroupName --output tsv) 20 | 21 | # Configure the function app settings 22 | az functionapp config appsettings set --name $functionAppName --resource-group $resourceGroupName --settings AzureWebJobsStorage="$connectionString" 23 | -------------------------------------------------------------------------------- /projects/drug-drug-interaction/ozkary-ai-ddi/getting_started.md: -------------------------------------------------------------------------------- 1 | ## Getting Started with Azure Function 2 | ### Last updated: March 8th 2021 3 | 4 | #### Project Structure 5 | The main project folder () can contain the following files: 6 | 7 | * **local.settings.json** - Used to store app settings and connection strings when running locally. This file doesn't get published to Azure. To learn more, see [local.settings.file](https://aka.ms/azure-functions/python/local-settings). 8 | * **requirements.txt** - Contains the list of Python packages the system installs when publishing to Azure. 9 | * **host.json** - Contains global configuration options that affect all functions in a function app. This file does get published to Azure. Not all options are supported when running locally. To learn more, see [host.json](https://aka.ms/azure-functions/python/host.json). 10 | * **.vscode/** - (Optional) Contains store VSCode configuration. To learn more, see [VSCode setting](https://aka.ms/azure-functions/python/vscode-getting-started). 11 | * **.venv/** - (Optional) Contains a Python virtual environment used by local development. 12 | * **Dockerfile** - (Optional) Used when publishing your project in a [custom container](https://aka.ms/azure-functions/python/custom-container). 13 | * **tests/** - (Optional) Contains the test cases of your function app. For more information, see [Unit Testing](https://aka.ms/azure-functions/python/unit-testing). 14 | * **.funcignore** - (Optional) Declares files that shouldn't get published to Azure. Usually, this file contains .vscode/ to ignore your editor setting, .venv/ to ignore local Python virtual environment, tests/ to ignore test cases, and local.settings.json to prevent local app settings being published. 15 | 16 | Each function has its own code file and binding configuration file ([**function.json**](https://aka.ms/azure-functions/python/function.json)). 17 | 18 | #### Developing your first Python function using VS Code 19 | 20 | If you have not already, please checkout our [quickstart](https://aka.ms/azure-functions/python/quickstart) to get you started with Azure Functions developments in Python. 21 | 22 | #### Publishing your function app to Azure 23 | 24 | For more information on deployment options for Azure Functions, please visit this [guide](https://docs.microsoft.com/en-us/azure/azure-functions/create-first-function-vs-code-python#publish-the-project-to-azure). 25 | 26 | #### Next Steps 27 | 28 | * To learn more about developing Azure Functions, please visit [Azure Functions Developer Guide](https://aka.ms/azure-functions/python/developer-guide). 29 | 30 | * To learn more specific guidance on developing Azure Functions with Python, please visit [Azure Functions Developer Python Guide](https://aka.ms/azure-functions/python/python-developer-guide). -------------------------------------------------------------------------------- /projects/drug-drug-interaction/ozkary-ai-ddi/host.json: -------------------------------------------------------------------------------- 1 | { 2 | "version": "2.0", 3 | "logging": { 4 | "applicationInsights": { 5 | "samplingSettings": { 6 | "isEnabled": true, 7 | "excludedTypes": "Request" 8 | } 9 | } 10 | }, 11 | "extensionBundle": { 12 | "id": "Microsoft.Azure.Functions.ExtensionBundle", 13 | "version": "[3.*, 4.0.0)" 14 | } 15 | } -------------------------------------------------------------------------------- /projects/drug-drug-interaction/ozkary-ai-ddi/predict/Pipfile: -------------------------------------------------------------------------------- 1 | [[source]] 2 | url = "https://pypi.org/simple" 3 | verify_ssl = true 4 | name = "pypi" 5 | 6 | [packages] 7 | flask = "*" 8 | gunicorn = "*" 9 | scikit-learn = "*" 10 | rdkit = "*" 11 | pandas = "*" 12 | numpy = "*" 13 | matplotlib = "*" 14 | seaborn = "*" 15 | xgboost = "*" 16 | azure-functions = "*" 17 | 18 | [dev-packages] 19 | 20 | [requires] 21 | python_version = "3.8" 22 | -------------------------------------------------------------------------------- /projects/drug-drug-interaction/ozkary-ai-ddi/predict/__init__.py: -------------------------------------------------------------------------------- 1 | import logging 2 | import os 3 | import azure.functions as func 4 | import sklearn 5 | import json 6 | from .ddi_lib import predict 7 | 8 | def main(req: func.HttpRequest) -> func.HttpResponse: 9 | logging.info('Predict HTTP trigger function processed a request.') 10 | 11 | try: 12 | data = req.get_json() 13 | if data: 14 | path = os.path.abspath(os.path.dirname(__file__)) 15 | # Make predictions 16 | predictions = predict(json.loads(data), path) 17 | 18 | # Return the predictions as a JSON response 19 | return func.HttpResponse(json.dumps(predictions), mimetype="application/json") 20 | else: 21 | return func.HttpResponse("Invalid input data.", status_code=400) 22 | except Exception as e: 23 | return func.HttpResponse(f"An error occurred: {str(e)}", status_code=500) 24 | -------------------------------------------------------------------------------- /projects/drug-drug-interaction/ozkary-ai-ddi/predict/__version__.py: -------------------------------------------------------------------------------- 1 | # your_package/__version__.py 2 | 3 | __version__ = "0.1.0" # Replace with your desired version number 4 | -------------------------------------------------------------------------------- /projects/drug-drug-interaction/ozkary-ai-ddi/predict/data/interaction_types.csv: -------------------------------------------------------------------------------- 1 | Interaction type,Description,Subject,DDI type 2 | 67,#Drug1 can cause a decrease in the absorption of #Drug2 resulting in a reduced serum concentration and potentially a decrease in efficacy.,2,DDI type 1 3 | 18,#Drug1 can cause an increase in the absorption of #Drug2 resulting in an increased serum concentration and potentially a worsening of adverse effects.,1,DDI type 2 4 | 13,The absorption of #Drug2 can be decreased when combined with #Drug1.,1,DDI type 3 5 | 3,The bioavailability of #Drug2 can be decreased when combined with #Drug1.,1,DDI type 4 6 | 62,The bioavailability of #Drug2 can be increased when combined with #Drug1.,1,DDI type 5 7 | 47,The metabolism of #Drug2 can be decreased when combined with #Drug1.,2,DDI type 6 8 | 4,The metabolism of #Drug2 can be increased when combined with #Drug1.,1,DDI type 7 9 | 43,The protein binding of #Drug2 can be decreased when combined with #Drug1.,2,DDI type 8 10 | 75,The serum concentration of #Drug2 can be decreased when it is combined with #Drug1.,1,DDI type 9 11 | 73,The serum concentration of #Drug2 can be increased when it is combined with #Drug1.,2,DDI type 10 12 | 77,The serum concentration of the active metabolites of #Drug2 can be increased when #Drug2 is used in combination with #Drug1.,1,DDI type 11 13 | 11,The serum concentration of the active metabolites of #Drug2 can be reduced when #Drug2 is used in combination with #Drug1 resulting in a loss in efficacy.,1,DDI type 12 14 | 70,The therapeutic efficacy of #Drug2 can be decreased when used in combination with #Drug1.,1,DDI type 13 15 | 8,The therapeutic efficacy of #Drug2 can be increased when used in combination with #Drug1.,1,DDI type 14 16 | 72,#Drug1 may decrease the excretion rate of #Drug2 which could result in a higher serum level.,2,DDI type 15 17 | 65,#Drug1 may increase the excretion rate of #Drug2 which could result in a lower serum level and potentially a reduction in efficacy.,2,DDI type 16 18 | 58,#Drug1 may decrease the cardiotoxic activities of #Drug2.,2,DDI type 17 19 | 15,#Drug1 may increase the cardiotoxic activities of #Drug2.,1,DDI type 18 20 | 44,#Drug1 may increase the central neurotoxic activities of #Drug2.,1,DDI type 19 21 | 80,#Drug1 may increase the hepatotoxic activities of #Drug2.,2,DDI type 20 22 | 57,#Drug1 may increase the nephrotoxic activities of #Drug2.,2,DDI type 21 23 | 35,#Drug1 may increase the neurotoxic activities of #Drug2.,1,DDI type 22 24 | 7,#Drug1 may increase the ototoxic activities of #Drug2.,2,DDI type 23 25 | 45,#Drug1 may decrease effectiveness of #Drug2 as a diagnostic agent.,2,DDI type 24 26 | 86,The risk of a hypersensitivity reaction to #Drug2 is increased when it is combined with #Drug1.,2,DDI type 25 27 | 49,The risk or severity of adverse effects can be increased when #Drug1 is combined with #Drug2.,3,DDI type 26 28 | 66,The risk or severity of bleeding can be increased when #Drug1 is combined with #Drug2.,3,DDI type 27 29 | 50,The risk or severity of heart failure can be increased when #Drug2 is combined with #Drug1.,3,DDI type 28 30 | 42,The risk or severity of hyperkalemia can be increased when #Drug1 is combined with #Drug2.,3,DDI type 29 31 | 31,The risk or severity of hypertension can be increased when #Drug2 is combined with #Drug1.,3,DDI type 30 32 | 56,The risk or severity of hypotension can be increased when #Drug1 is combined with #Drug2.,3,DDI type 31 33 | 33,The risk or severity of QTc prolongation can be increased when #Drug1 is combined with #Drug2.,3,DDI type 32 34 | 52,#Drug1 may decrease the analgesic activities of #Drug2.,2,DDI type 33 35 | 12,#Drug1 may decrease the anticoagulant activities of #Drug2.,2,DDI type 34 36 | 37,#Drug1 may decrease the antihypertensive activities of #Drug2.,2,DDI type 35 37 | 26,#Drug1 may decrease the antiplatelet activities of #Drug2.,1,DDI type 36 38 | 14,#Drug1 may decrease the bronchodilatory activities of #Drug2.,1,DDI type 37 39 | 29,#Drug1 may decrease the diuretic activities of #Drug2.,1,DDI type 38 40 | 17,#Drug1 may decrease the neuromuscular blocking activities of #Drug2.,1,DDI type 39 41 | 76,#Drug1 may decrease the sedative activities of #Drug2.,1,DDI type 40 42 | 61,#Drug1 may decrease the stimulatory activities of #Drug2.,2,DDI type 41 43 | 5,#Drug1 may decrease the vasoconstricting activities of #Drug2.,1,DDI type 42 44 | 22,#Drug1 may increase the adverse neuromuscular activities of #Drug2.,2,DDI type 43 45 | 69,#Drug1 may increase the analgesic activities of #Drug2.,2,DDI type 44 46 | 2,#Drug1 may increase the anticholinergic activities of #Drug2.,2,DDI type 45 47 | 6,#Drug1 may increase the anticoagulant activities of #Drug2.,2,DDI type 46 48 | 10,#Drug1 may increase the antihypertensive activities of #Drug2.,1,DDI type 47 49 | 53,#Drug1 may increase the antiplatelet activities of #Drug2.,2,DDI type 48 50 | 36,#Drug1 may increase the antipsychotic activities of #Drug2.,2,DDI type 49 51 | 82,#Drug1 may increase the arrhythmogenic activities of #Drug2.,2,DDI type 50 52 | 25,#Drug1 may increase the atrioventricular blocking (AV block) activities of #Drug2.,1,DDI type 51 53 | 54,#Drug1 may increase the bradycardic activities of #Drug2.,1,DDI type 52 54 | 46,#Drug1 may increase the bronchoconstrictory activities of #Drug2.,1,DDI type 53 55 | 16,#Drug1 may increase the central nervous system depressant (CNS depressant) activities of #Drug2.,2,DDI type 54 56 | 79,#Drug1 may increase the central nervous system depressant (CNS depressant) and hypertensive activities of #Drug2.,1,DDI type 55 57 | 39,#Drug1 may increase the constipating activities of #Drug2.,1,DDI type 56 58 | 28,#Drug1 may increase the dermatologic adverse activities of #Drug2.,2,DDI type 57 59 | 74,#Drug1 may increase the fluid retaining activities of #Drug2.,1,DDI type 58 60 | 51,#Drug1 may increase the hypercalcemic activities of #Drug2.,2,DDI type 59 61 | 78,#Drug1 may increase the hyperglycemic activities of #Drug2.,2,DDI type 60 62 | 68,#Drug1 may increase the hyperkalemic activities of #Drug2.,2,DDI type 61 63 | 71,#Drug1 may increase the hypertensive activities of #Drug2.,2,DDI type 62 64 | 24,#Drug1 may increase the hypocalcemic activities of #Drug2.,2,DDI type 63 65 | 9,#Drug1 may increase the hypoglycemic activities of #Drug2.,2,DDI type 64 66 | 83,#Drug1 may increase the hypokalemic activities of #Drug2.,1,DDI type 65 67 | 55,#Drug1 may increase the hyponatremic activities of #Drug2.,2,DDI type 66 68 | 60,#Drug1 may increase the hypotensive activities of #Drug2.,1,DDI type 67 69 | 41,#Drug1 may increase the hypotensive and central nervous system depressant (CNS depressant) activities of #Drug2.,1,DDI type 68 70 | 34,#Drug1 may increase the immunosuppressive activities of #Drug2.,1,DDI type 69 71 | 63,#Drug1 may increase the myelosuppressive activities of #Drug2.,2,DDI type 70 72 | 48,#Drug1 may increase the myopathic rhabdomyolysis activities of #Drug2.,2,DDI type 71 73 | 27,#Drug1 may increase the neuroexcitatory activities of #Drug2.,1,DDI type 72 74 | 21,#Drug1 may increase the neuromuscular blocking activities of #Drug2.,2,DDI type 73 75 | 30,#Drug1 may increase the orthostatic hypotensive activities of #Drug2.,1,DDI type 74 76 | 1,#Drug1 may increase the photosensitizing activities of #Drug2.,1,DDI type 75 77 | 20,#Drug1 may increase the QTc-prolonging activities of #Drug2.,2,DDI type 76 78 | 40,#Drug1 may increase the respiratory depressant activities of #Drug2.,2,DDI type 77 79 | 32,#Drug1 may increase the sedative activities of #Drug2.,1,DDI type 78 80 | 64,#Drug1 may increase the serotonergic activities of #Drug2.,2,DDI type 79 81 | 23,#Drug1 may increase the stimulatory activities of #Drug2.,2,DDI type 80 82 | 85,#Drug1 may increase the tachycardic activities of #Drug2.,1,DDI type 81 83 | 81,#Drug1 may increase the thrombogenic activities of #Drug2.,2,DDI type 82 84 | 59,#Drug1 may increase the ulcerogenic activities of #Drug2.,1,DDI type 83 85 | 19,#Drug1 may increase the vasoconstricting activities of #Drug2.,2,DDI type 84 86 | 38,#Drug1 may increase the vasodilatory activities of #Drug2.,2,DDI type 85 87 | 84,#Drug1 may increase the vasopressor activities of #Drug2.,1,DDI type 86 -------------------------------------------------------------------------------- /projects/drug-drug-interaction/ozkary-ai-ddi/predict/ddi_lib/__init__.py: -------------------------------------------------------------------------------- 1 | # your_package/__init__.py 2 | from .__version__ import __version__ 3 | # from .data_train import DDIProcessData, DDIModelFactory 4 | # from .data_train_mlp import DDIMLPFactory, DDIProcessor 5 | from .data_predict import DDIPredictor, DDIModelLoader, predict, load_test_cases 6 | 7 | print(f"Initializing drug-drug interaction ddi_lib {__version__}") -------------------------------------------------------------------------------- /projects/drug-drug-interaction/ozkary-ai-ddi/predict/ddi_lib/__version__.py: -------------------------------------------------------------------------------- 1 | # your_package/__version__.py 2 | 3 | __version__ = "0.1.0" # Replace with your desired version number 4 | -------------------------------------------------------------------------------- /projects/drug-drug-interaction/ozkary-ai-ddi/predict/ddi_lib/data_predict.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # coding: utf-8 3 | 4 | # Drug to Drug Interaction - Predicting the interaction 5 | 6 | import os 7 | import numpy as np 8 | import pandas as pd 9 | import sklearn 10 | import pickle 11 | from rdkit import Chem 12 | from rdkit.Chem import AllChem 13 | 14 | 15 | class DDIModelLoader(): 16 | """ 17 | Class to load a model from a pickle file and make predictions 18 | """ 19 | def __init__(self, model_path, encoder_path=None): 20 | self.model_path = model_path 21 | self.encoder_path = encoder_path 22 | self.model = None 23 | self.encoder = None 24 | self.load_model() 25 | 26 | def load_model(self): 27 | """ 28 | Load the model from the pickle file 29 | Load encoder if use_encoder is set 30 | """ 31 | 32 | with open(self.model_path, 'rb') as model: 33 | self.model = pickle.load(model) 34 | 35 | if self.encoder_path is not None: 36 | with open(self.encoder_path, 'rb') as encoder: 37 | self.encoder = pickle.load(encoder) 38 | 39 | def predict(self, X): 40 | 41 | # Transform the data 42 | # X_encoded = self.encoder.transform(X) 43 | 44 | # Predict the results 45 | y_pred = self.model.predict(X) 46 | return y_pred 47 | 48 | 49 | class DDIPredictor: 50 | """ 51 | Maps the predictions to the original labels and meaning 52 | """ 53 | def __init__(self, model_path, encoder_path, data_path): 54 | self.model = DDIModelLoader(model_path, encoder_path) 55 | self.interactions = None 56 | self.data_path = data_path 57 | self.drug_pca_lookup = None 58 | self.load_drug_pca_lookup() 59 | 60 | def load_drug_pca_lookup(self): 61 | """ 62 | Load the drug pca lookup from pickle file 63 | """ 64 | if self.drug_pca_lookup is None: 65 | pca_drugs = pd.read_csv(f'{self.data_path}drugbank_pca50.csv.gz',index_col=0) 66 | pca_drugs['name'] = pca_drugs['name'].str.lower() 67 | # convert drugs to a dictionary using name in lowercase as the key 68 | self.drug_pca_lookup = pca_drugs.set_index('name').T.to_dict('list') 69 | 70 | def predict(self, X): 71 | """ 72 | Predict the results and map them to the original labels 73 | """ 74 | return self.model.predict(X) 75 | 76 | def feature_names(self): 77 | """ 78 | Return the feature names 79 | """ 80 | # if the property feature name exists, return it 81 | if hasattr(self.model.model, 'feature_names'): 82 | return self.model.feature_names 83 | 84 | return None 85 | 86 | def build_model_input(self, rxs, features=101): 87 | """ 88 | Build the message to be returned using a pca lookup 89 | """ 90 | # select the mean of the pc_ columns from the dict_drugs 91 | mean_pca = np.mean(list(self.drug_pca_lookup.values()), axis=0).tolist() 92 | 93 | #for each unique drug name in rxs do a lookup to get the pca values 94 | inputs = [] 95 | for rx in rxs: 96 | # from the dict get all the keys with the drug names 97 | pca_drug = [] 98 | for label in ['drug1','drug2']: 99 | name = rx[label].lower() 100 | pca = self.drug_pca_lookup[name] if name in self.drug_pca_lookup else mean_pca 101 | pca_drug = pca_drug + pca 102 | pca_drug = pca_drug + [rx['ssp']] 103 | inputs = inputs + [pca_drug] 104 | 105 | # convert the list to a numpy array to see the shape 106 | X = np.array(inputs) 107 | print(X.shape) 108 | 109 | return X 110 | 111 | def build_model_message(self, ssp_values, features=101): 112 | """ 113 | Build the message to be returned 114 | """ 115 | 116 | X = np.zeros((len(ssp_values), features)) 117 | for index, ssp in enumerate(ssp_values): 118 | X[index,-1] = ssp 119 | 120 | print(X.shape) 121 | return X 122 | 123 | def get_ddi_description(self, results): 124 | # check if the list is empty 125 | 126 | if len(results) == 0: 127 | return None 128 | 129 | if self.interactions is None: 130 | self.load_interactions() 131 | 132 | notes = [] 133 | for result in results: 134 | # select the row with ddi_type = result 135 | ddi_type = int(result['result']) + 1 136 | ddi_row = self.interactions.loc[self.interactions['ddi_type'] == ddi_type] 137 | 138 | # add one to the result to match the encoding during training 139 | note = f'No interaction found for {result["drug1"]} and {result["drug2"]}' 140 | if ddi_row is not None and len(ddi_row) > 0: 141 | note = ddi_row['description'].to_string(index=False).replace('#Drug1', result['drug1']).replace('#Drug2', result['drug2']) 142 | notes.append(note) 143 | 144 | return notes 145 | 146 | def load_interactions(self): 147 | """ 148 | Load the interactions from csv 149 | """ 150 | df = pd.read_csv(f'{self.data_path}interaction_types.csv') 151 | 152 | #rename the columns to lowercase and replace spaces with underscore 153 | df.columns = map(str.lower, df.columns) 154 | df.columns = df.columns.str.replace(' ', '_') 155 | 156 | # remove the DDI type text from the ddi_type column 157 | df['ddi_type'] = df['ddi_type'].str.replace('DDI type ', '') 158 | 159 | # cast the ddi_type column to integer 160 | df['ddi_type'] = df['ddi_type'].astype(int) 161 | self.interactions = df 162 | 163 | def calculate_ssp(self, smiles_drug1, smiles_drug2): 164 | 165 | """ 166 | Structural Similarity Profile (SSP) for drug pairs 167 | """ 168 | 169 | # check if the SMILE code is valid 170 | if smiles_drug1 is None or smiles_drug2 is None: 171 | return 0 172 | 173 | try: 174 | mol_drug1 = Chem.MolFromSmiles(smiles_drug1) 175 | mol_drug2 = Chem.MolFromSmiles(smiles_drug2) 176 | 177 | fp_drug1 = AllChem.GetMorganFingerprintAsBitVect(mol_drug1, 2, nBits=1024) 178 | fp_drug2 = AllChem.GetMorganFingerprintAsBitVect(mol_drug2, 2, nBits=1024) 179 | 180 | array_fp_drug1 = np.array(list(fp_drug1.ToBitString())).astype(int) 181 | array_fp_drug2 = np.array(list(fp_drug2.ToBitString())).astype(int) 182 | 183 | tanimoto_similarity = np.sum(np.logical_and(array_fp_drug1, array_fp_drug2)) / np.sum(np.logical_or(array_fp_drug1, array_fp_drug2)) 184 | 185 | return tanimoto_similarity 186 | except: 187 | return 0 188 | 189 | 190 | def load_test_cases(): 191 | """ 192 | Load the test cases from the csv file 193 | """ 194 | df_test_cases = pd.read_csv('./data/test_cases.csv') 195 | # make all columns lowercase and replace spaces with underscores 196 | df_test_cases.columns = [col.lower().replace(' ', '_') for col in df_test_cases.columns] 197 | 198 | # convert the data into a drug pair using the prescription column 199 | prescriptions = {} 200 | for index, row in df_test_cases.iterrows(): 201 | rx = row['prescription'] 202 | if rx not in prescriptions: 203 | prescriptions[rx] = {} 204 | 205 | # get the key count to start bulding the properties 206 | key_count = 1 if len(prescriptions[rx]) == 0 else 2 207 | drug = f'drug{key_count}' 208 | smile = f'smiles{key_count}' 209 | prescriptions[rx][drug] = row['drug_name'] 210 | prescriptions[rx][smile] = row['smiles'] 211 | 212 | return prescriptions 213 | 214 | def predict(data, path = './'): 215 | """ 216 | Predict the DDI for the given data 217 | """ 218 | 219 | # load the model 220 | data_path = os.path.join(path,'data/') 221 | model_path = os.path.join(path,'models') 222 | print('resources',model_path, data_path) 223 | 224 | model_file = f'{model_path}/ozkary_ddi_xgboost.pkl.bin' 225 | encoder_file = F'{model_path}/ozkary_ddi_encoder.pkl.bin' 226 | 227 | predictor = DDIPredictor(model_file, encoder_file, data_path) 228 | 229 | prescriptions = data 230 | 231 | if not isinstance(prescriptions, dict): 232 | raise Exception(f'Invalid data type. Expected a dictionary {data}') 233 | 234 | # for each drug pair calculate the structural similarity profile 235 | for rx in prescriptions: 236 | drug1 = prescriptions[rx]['smiles1'] 237 | drug2 = prescriptions[rx]['smiles2'] 238 | prescriptions[rx]['ssp'] = predictor.calculate_ssp(drug1, drug2) 239 | 240 | print(prescriptions.values()) 241 | 242 | # select all the ssp values from the list 243 | ssp_values = [item['ssp'] for item in prescriptions.values()] 244 | # X = predictor.build_model_message(ssp_values) 245 | X = predictor.build_model_input(prescriptions.values()) 246 | 247 | # run a prediction 248 | y_pred = predictor.predict(X) 249 | print('Predictions ', y_pred) 250 | # for each y_pred value add it to the dictionary 251 | for index, item in enumerate(prescriptions.values()): 252 | item['result'] = y_pred[index] 253 | 254 | # print the results 255 | print(prescriptions.values()) 256 | 257 | # load the interactions types file 258 | result = predictor.get_ddi_description(prescriptions.values()) 259 | 260 | return result 261 | 262 | 263 | # add a main function for the entry point to the program 264 | if __name__ == '__main__': 265 | os.system('clear') 266 | print('Running DDI main function') 267 | 268 | prescriptions = load_test_cases() 269 | results = predict(prescriptions) 270 | for result in results: 271 | print(result) 272 | -------------------------------------------------------------------------------- /projects/drug-drug-interaction/ozkary-ai-ddi/predict/function.json: -------------------------------------------------------------------------------- 1 | { 2 | "scriptFile": "__init__.py", 3 | "bindings": [ 4 | { 5 | "authLevel": "anonymous", 6 | "type": "httpTrigger", 7 | "direction": "in", 8 | "name": "req", 9 | "route": "predict", 10 | "methods": [ 11 | "get", 12 | "post" 13 | ] 14 | }, 15 | { 16 | "type": "http", 17 | "direction": "out", 18 | "name": "$return" 19 | } 20 | ] 21 | } -------------------------------------------------------------------------------- /projects/drug-drug-interaction/ozkary-ai-ddi/predict/models/ozkary-ddi.h5: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ozkary/machine-learning-engineering/ed19b242fd476d9e8ad6f80385e75241acbc2c49/projects/drug-drug-interaction/ozkary-ai-ddi/predict/models/ozkary-ddi.h5 -------------------------------------------------------------------------------- /projects/drug-drug-interaction/ozkary-ai-ddi/predict/models/ozkary_ddi_encoder.pkl.bin: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ozkary/machine-learning-engineering/ed19b242fd476d9e8ad6f80385e75241acbc2c49/projects/drug-drug-interaction/ozkary-ai-ddi/predict/models/ozkary_ddi_encoder.pkl.bin -------------------------------------------------------------------------------- /projects/drug-drug-interaction/ozkary-ai-ddi/predict/models/ozkary_ddi_xgboost.pkl.bin: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ozkary/machine-learning-engineering/ed19b242fd476d9e8ad6f80385e75241acbc2c49/projects/drug-drug-interaction/ozkary-ai-ddi/predict/models/ozkary_ddi_xgboost.pkl.bin -------------------------------------------------------------------------------- /projects/drug-drug-interaction/ozkary-ai-ddi/requirements.txt: -------------------------------------------------------------------------------- 1 | # Do not include azure-functions-worker as it may conflict with the Azure Functions platform 2 | azure-functions==1.17.0 3 | blinker==1.7.0; python_version >= '3.8' 4 | click==8.1.7; python_version >= '3.7' 5 | contourpy==1.1.1; python_version >= '3.8' 6 | cycler==0.12.1; python_version >= '3.8' 7 | flask==3.0.0; python_version >= '3.8' 8 | fonttools==4.47.0; python_version >= '3.8' 9 | gunicorn==21.2.0; python_version >= '3.5' 10 | importlib-metadata==7.0.0; python_version < '3.10' 11 | importlib-resources==6.1.1; python_version < '3.10' 12 | itsdangerous==2.1.2; python_version >= '3.7' 13 | jinja2==3.1.2; python_version >= '3.7' 14 | joblib==1.3.2; python_version >= '3.7' 15 | kiwisolver==1.4.5; python_version >= '3.7' 16 | markupsafe==2.1.3; python_version >= '3.7' 17 | matplotlib==3.7.4; python_version >= '3.8' 18 | numpy==1.24.4; python_version >= '3.8' 19 | packaging==23.2; python_version >= '3.7' 20 | pandas==2.0.3; python_version >= '3.8' 21 | pillow==10.1.0; python_version >= '3.8' 22 | pyparsing==3.1.1; python_full_version >= '3.6.8' 23 | python-dateutil==2.8.2; python_version >= '2.7' and python_version not in '3.0, 3.1, 3.2, 3.3' 24 | pytz==2023.3.post1 25 | rdkit==2023.9.2 26 | scikit-learn==1.3.2; python_version >= '3.8' 27 | scipy==1.10.1; python_version < '3.12' and python_version >= '3.8' 28 | seaborn==0.13.0; python_version >= '3.8' 29 | six==1.16.0; python_version >= '2.7' and python_version not in '3.0, 3.1, 3.2, 3.3' 30 | threadpoolctl==3.2.0; python_version >= '3.8' 31 | tzdata==2023.3; python_version >= '2' 32 | werkzeug==3.0.1; python_version >= '3.8' 33 | xgboost==2.0.2; python_version >= '3.8' 34 | zipp==3.17.0; python_version >= '3.8' 35 | -------------------------------------------------------------------------------- /projects/drug-drug-interaction/requirements.txt: -------------------------------------------------------------------------------- 1 | azure-functions==1.17.0 2 | blinker==1.7.0; python_version >= '3.8' 3 | click==8.1.7; python_version >= '3.7' 4 | contourpy==1.1.1; python_version >= '3.8' 5 | cycler==0.12.1; python_version >= '3.8' 6 | flask==3.0.0; python_version >= '3.8' 7 | fonttools==4.47.0; python_version >= '3.8' 8 | gunicorn==21.2.0; python_version >= '3.5' 9 | importlib-metadata==7.0.0; python_version < '3.10' 10 | importlib-resources==6.1.1; python_version < '3.10' 11 | itsdangerous==2.1.2; python_version >= '3.7' 12 | jinja2==3.1.2; python_version >= '3.7' 13 | joblib==1.3.2; python_version >= '3.7' 14 | kiwisolver==1.4.5; python_version >= '3.7' 15 | markupsafe==2.1.3; python_version >= '3.7' 16 | matplotlib==3.7.4; python_version >= '3.8' 17 | numpy==1.24.4; python_version >= '3.8' 18 | packaging==23.2; python_version >= '3.7' 19 | pandas==2.0.3; python_version >= '3.8' 20 | pillow==10.1.0; python_version >= '3.8' 21 | pyparsing==3.1.1; python_full_version >= '3.6.8' 22 | python-dateutil==2.8.2; python_version >= '2.7' and python_version not in '3.0, 3.1, 3.2, 3.3' 23 | pytz==2023.3.post1 24 | rdkit==2023.9.2 25 | scikit-learn==1.3.2; python_version >= '3.8' 26 | scipy==1.10.1; python_version < '3.12' and python_version >= '3.8' 27 | seaborn==0.13.0; python_version >= '3.8' 28 | six==1.16.0; python_version >= '2.7' and python_version not in '3.0, 3.1, 3.2, 3.3' 29 | threadpoolctl==3.2.0; python_version >= '3.8' 30 | tzdata==2023.3; python_version >= '2' 31 | werkzeug==3.0.1; python_version >= '3.8' 32 | xgboost==2.0.2; python_version >= '3.8' 33 | zipp==3.17.0; python_version >= '3.8' 34 | -------------------------------------------------------------------------------- /projects/drug-drug-interaction/serverless_deploy.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | # Variables 4 | functionAppName="./ozkary-ai-ddi/predict" 5 | 6 | echo 'Deploying function app $functionAppName' 7 | 8 | mkdir $functionAppName/data/ 9 | mkdir $functionAppName/ddi_lib/ 10 | mkdir $functionAppName/models/ 11 | 12 | cp -r ./models/* $functionAppName/models/ 13 | cp -r ./data/interaction_types.csv $functionAppName/data/ 14 | cp -r ./ddi_lib/* $functionAppName/ddi_lib/ 15 | cp ./Pipfile* $functionAppName 16 | 17 | echo 'Copy the contents of app.py to the function file __init__.py' 18 | 19 | 20 | 21 | -------------------------------------------------------------------------------- /projects/heart-disease-risk/.gitignore: -------------------------------------------------------------------------------- 1 | !test_cases.csv -------------------------------------------------------------------------------- /projects/heart-disease-risk/Dockerfile: -------------------------------------------------------------------------------- 1 | # Use the base image 2 | FROM svizor/zoomcamp-model:3.10.12-slim 3 | 4 | # Set the working directory 5 | WORKDIR /app 6 | 7 | # Copy the Pipenv files to the container 8 | COPY Pipfile Pipfile.lock /app/ 9 | 10 | # Install pipenv and dependencies 11 | RUN pip install pipenv 12 | RUN pipenv install --system --deploy 13 | 14 | # copy the bin files 15 | COPY bin/ /app/bin/ 16 | 17 | # Copy the Flask script to the container 18 | COPY data_predict.py /app/ 19 | COPY app.py /app/ 20 | 21 | # Expose the port your Flask app runs on 22 | EXPOSE 8000 23 | 24 | # Run the Flask app with Gunicorn 25 | CMD ["gunicorn", "app:app", "--bind", "0.0.0.0:8000", "--workers", "4"] 26 | -------------------------------------------------------------------------------- /projects/heart-disease-risk/Pipfile: -------------------------------------------------------------------------------- 1 | [[source]] 2 | url = "https://pypi.org/simple" 3 | verify_ssl = true 4 | name = "pypi" 5 | 6 | [packages] 7 | scikit-learn = "==1.2.2" 8 | flask = "*" 9 | gunicorn = "*" 10 | azure-functions = "*" 11 | 12 | [dev-packages] 13 | 14 | [requires] 15 | python_version = "3.8" 16 | -------------------------------------------------------------------------------- /projects/heart-disease-risk/app.py: -------------------------------------------------------------------------------- 1 | # import libraries for web API support 2 | from flask import Flask, request, jsonify 3 | import sklearn 4 | import json 5 | # import the data prediction module 6 | from data_predict import predict, probability_label 7 | 8 | # create a Flask app instance 9 | app = Flask(__name__) 10 | 11 | VERSION = '1.0.0' 12 | LABEL = 'Heart Disease Risk Prediction API' 13 | 14 | # define the root endpoint 15 | @app.route('/', methods=['GET']) 16 | # define the root endpoint function 17 | def root(): 18 | return f'{LABEL} {VERSION}' 19 | 20 | print(f"Loading {LABEL} {VERSION}") 21 | print("sklearn version", sklearn.__version__) 22 | 23 | # define the predict endpoint 24 | @app.route('/predict', methods=['POST']) 25 | # define the predict endpoint function 26 | def predict_endpoint(): 27 | print("Predict endpoint called") 28 | # get the request body 29 | data = request.get_json() 30 | 31 | # Parse the JSON string into a dictionary 32 | data_dict = json.loads(data) 33 | 34 | # get the prediction 35 | predictions = predict(data_dict) 36 | 37 | # create a risk_score based on the prediction 38 | results = [] 39 | 40 | # for each prediction, add the risk_score and risk_label 41 | for score in predictions: 42 | # create a risk_score based on the prediction 43 | risk_score = round(score,4) 44 | risk_label = probability_label(score) 45 | results.append({'risk_score': risk_score, 'risk_label': risk_label}) 46 | 47 | # return the prediction 48 | return jsonify(results) 49 | 50 | # load the application 51 | if __name__ == '__main__': 52 | app.run(debug=True, host='0.0.0.0', port=8000) 53 | 54 | -------------------------------------------------------------------------------- /projects/heart-disease-risk/bin/hd_dictvectorizer.pkl.bin: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ozkary/machine-learning-engineering/ed19b242fd476d9e8ad6f80385e75241acbc2c49/projects/heart-disease-risk/bin/hd_dictvectorizer.pkl.bin -------------------------------------------------------------------------------- /projects/heart-disease-risk/bin/hd_xgboost_model.pkl.bin: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ozkary/machine-learning-engineering/ed19b242fd476d9e8ad6f80385e75241acbc2c49/projects/heart-disease-risk/bin/hd_xgboost_model.pkl.bin -------------------------------------------------------------------------------- /projects/heart-disease-risk/data/test_cases.csv: -------------------------------------------------------------------------------- 1 | bmi,smoking,alcoholdrinking,stroke,physicalhealth,mentalhealth,diffwalking,sex,agecategory,race,diabetic,physicalactivity,genhealth,sleeptime,asthma,kidneydisease,skincancer 2 | 40,0,0,0,0,0,1,Male,65-69,White,No,1,Good,10,0,0,0 3 | 34,1,0,0,30,0,1,Male,60-64,White,Yes,0,Poor,15,1,0,0 4 | 28,1,0,0,0,0,0,Female,55-59,White,No,1,Very good,5,0,0,0 5 | -------------------------------------------------------------------------------- /projects/heart-disease-risk/data_predict.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # coding: utf-8 3 | 4 | # Heart Disease Risk Analysis Data - Predicting Heart Disease 5 | 6 | import pickle 7 | 8 | # load the models 9 | model_filename = './bin/hd_xgboost_model.pkl.bin' 10 | dv_filename = './bin/hd_dictvectorizer.pkl.bin' 11 | 12 | # Load the model and dv from the files 13 | with open(model_filename, 'rb') as model_file: 14 | loaded_model = pickle.load(model_file) 15 | 16 | with open(dv_filename, 'rb') as dv_file: 17 | loaded_dv = pickle.load(dv_file) 18 | 19 | def probability_label(probability): 20 | 21 | labels = ['none','low', 'medium', 'high'] 22 | label = 'unknown' 23 | 24 | # return the label based on the probability 25 | if probability < 0.3: 26 | label = labels[0] 27 | elif probability < 0.50: 28 | label = labels[1] 29 | elif probability < 0.75: 30 | label = labels[2] 31 | elif probability >= 0.75: 32 | label = labels[3] 33 | 34 | return label 35 | 36 | def predict(data): 37 | # Transform the data 38 | X = loaded_dv.transform(data) 39 | # Predict the probability 40 | y_pred = loaded_model.predict_proba(X)[:, 1] 41 | 42 | return y_pred 43 | 44 | -------------------------------------------------------------------------------- /projects/heart-disease-risk/data_test_api.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Heart Disease Risk Analysis Data - Test the API" 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": 1, 13 | "metadata": {}, 14 | "outputs": [], 15 | "source": [ 16 | "# Make the request to make the prediction\n", 17 | "import requests\n", 18 | "import pandas as pd\n", 19 | "import json\n" 20 | ] 21 | }, 22 | { 23 | "cell_type": "code", 24 | "execution_count": 5, 25 | "metadata": {}, 26 | "outputs": [ 27 | { 28 | "data": { 29 | "text/html": [ 30 | "
\n", 31 | "\n", 44 | "\n", 45 | " \n", 46 | " \n", 47 | " \n", 48 | " \n", 49 | " \n", 50 | " \n", 51 | " \n", 52 | " \n", 53 | " \n", 54 | " \n", 55 | " \n", 56 | " \n", 57 | " \n", 58 | " \n", 59 | " \n", 60 | " \n", 61 | " \n", 62 | " \n", 63 | " \n", 64 | " \n", 65 | " \n", 66 | " \n", 67 | " \n", 68 | " \n", 69 | " \n", 70 | " \n", 71 | " \n", 72 | " \n", 73 | " \n", 74 | " \n", 75 | " \n", 76 | " \n", 77 | " \n", 78 | " \n", 79 | " \n", 80 | " \n", 81 | " \n", 82 | " \n", 83 | " \n", 84 | " \n", 85 | " \n", 86 | " \n", 87 | " \n", 88 | " \n", 89 | " \n", 90 | " \n", 91 | " \n", 92 | " \n", 93 | " \n", 94 | " \n", 95 | " \n", 96 | " \n", 97 | " \n", 98 | " \n", 99 | " \n", 100 | " \n", 101 | " \n", 102 | " \n", 103 | " \n", 104 | " \n", 105 | " \n", 106 | " \n", 107 | " \n", 108 | " \n", 109 | " \n", 110 | " \n", 111 | " \n", 112 | " \n", 113 | " \n", 114 | " \n", 115 | " \n", 116 | " \n", 117 | " \n", 118 | " \n", 119 | " \n", 120 | " \n", 121 | " \n", 122 | " \n", 123 | " \n", 124 | " \n", 125 | " \n", 126 | " \n", 127 | " \n", 128 | " \n", 129 | "
bmismokingalcoholdrinkingstrokephysicalhealthmentalhealthdiffwalkingsexagecategoryracediabeticphysicalactivitygenhealthsleeptimeasthmakidneydiseaseskincancer
040000001Male65-69WhiteNo1Good10000
1341003001Male60-64WhiteYes0Poor15100
228100000Female55-59WhiteNo1Very good5000
\n", 130 | "
" 131 | ], 132 | "text/plain": [ 133 | " bmi smoking alcoholdrinking stroke physicalhealth mentalhealth \\\n", 134 | "0 40 0 0 0 0 0 \n", 135 | "1 34 1 0 0 30 0 \n", 136 | "2 28 1 0 0 0 0 \n", 137 | "\n", 138 | " diffwalking sex agecategory race diabetic physicalactivity \\\n", 139 | "0 1 Male 65-69 White No 1 \n", 140 | "1 1 Male 60-64 White Yes 0 \n", 141 | "2 0 Female 55-59 White No 1 \n", 142 | "\n", 143 | " genhealth sleeptime asthma kidneydisease skincancer \n", 144 | "0 Good 10 0 0 0 \n", 145 | "1 Poor 15 1 0 0 \n", 146 | "2 Very good 5 0 0 0 " 147 | ] 148 | }, 149 | "execution_count": 5, 150 | "metadata": {}, 151 | "output_type": "execute_result" 152 | } 153 | ], 154 | "source": [ 155 | "# open the test cases csv file and read it into a pandas dataframe \n", 156 | "df = pd.read_csv('./data/test_cases.csv', sep=',', quotechar='\"')\n", 157 | "df.head()\n" 158 | ] 159 | }, 160 | { 161 | "cell_type": "code", 162 | "execution_count": 19, 163 | "metadata": {}, 164 | "outputs": [], 165 | "source": [ 166 | "def call_api(url,df):\n", 167 | " # Make the request and display the response \n", 168 | " \n", 169 | " data = df.to_dict(orient='records')\n", 170 | "\n", 171 | " # Convert DataFrame data to JSON string\n", 172 | " payload = json.dumps(data)\n", 173 | "\n", 174 | " response = requests.post(url, json=payload)\n", 175 | "\n", 176 | " # Check the response status code\n", 177 | " if response.status_code == 200:\n", 178 | " # If the response status is 200 (OK), print the JSON response\n", 179 | " json_response = response.json() \n", 180 | " print(f\"Results: {json_response}\") \n", 181 | " \n", 182 | " else:\n", 183 | " # If the response status is not 200, print an error message\n", 184 | " print(\"Error:\", response.status_code, response.text)\n" 185 | ] 186 | }, 187 | { 188 | "cell_type": "code", 189 | "execution_count": 34, 190 | "metadata": {}, 191 | "outputs": [ 192 | { 193 | "name": "stdout", 194 | "output_type": "stream", 195 | "text": [ 196 | "Results: [{'risk_label': 'none', 'risk_score': 0.1065}, {'risk_label': 'low', 'risk_score': 0.3642}, {'risk_label': 'none', 'risk_score': 0.0504}]\n" 197 | ] 198 | } 199 | ], 200 | "source": [ 201 | "# define the API local end-point\n", 202 | "\n", 203 | "url = 'http://0.0.0.0:8000/predict'\n", 204 | "\n", 205 | "call_api(url,df)\n" 206 | ] 207 | }, 208 | { 209 | "cell_type": "markdown", 210 | "metadata": {}, 211 | "source": [ 212 | "# run the API from Docker container\n", 213 | "- Shutdown the previous API\n", 214 | "- Build the Docker Container\n", 215 | " \n", 216 | "```bash\n", 217 | "docker build -t heart_disease_app .\n", 218 | "```\n", 219 | "\n", 220 | "- Once the image is built, you can run the Docker container using:\n", 221 | "\n", 222 | "```bash\n", 223 | "docker run -p 8000:8000 heart_disease_app\n", 224 | "```\n", 225 | "\n", 226 | "- Repeat the API test cases" 227 | ] 228 | }, 229 | { 230 | "cell_type": "markdown", 231 | "metadata": {}, 232 | "source": [ 233 | "## Azure Function - Cloud Deployment" 234 | ] 235 | }, 236 | { 237 | "cell_type": "code", 238 | "execution_count": 28, 239 | "metadata": {}, 240 | "outputs": [ 241 | { 242 | "name": "stdout", 243 | "output_type": "stream", 244 | "text": [ 245 | "Results: [{'risk_score': 0.1065, 'risk_label': 'none'}, {'risk_score': 0.3642, 'risk_label': 'low'}, {'risk_score': 0.0504, 'risk_label': 'none'}]\n" 246 | ] 247 | } 248 | ], 249 | "source": [ 250 | "# run the function locally\n", 251 | "url = 'http://localhost:7071/api/predict'\n", 252 | "call_api(url,df)" 253 | ] 254 | }, 255 | { 256 | "cell_type": "code", 257 | "execution_count": 37, 258 | "metadata": {}, 259 | "outputs": [ 260 | { 261 | "name": "stdout", 262 | "output_type": "stream", 263 | "text": [ 264 | "Results: [{'risk_score': 0.1065, 'risk_label': 'none'}, {'risk_score': 0.3642, 'risk_label': 'low'}, {'risk_score': 0.0504, 'risk_label': 'none'}]\n" 265 | ] 266 | } 267 | ], 268 | "source": [ 269 | "# run the function in Azure\n", 270 | "url = 'https://fn-ai-ml-heart-disease.azurewebsites.net/api/predict'\n", 271 | "\n", 272 | "call_api(url,df)" 273 | ] 274 | } 275 | ], 276 | "metadata": { 277 | "kernelspec": { 278 | "display_name": "Python 3", 279 | "language": "python", 280 | "name": "python3" 281 | }, 282 | "language_info": { 283 | "codemirror_mode": { 284 | "name": "ipython", 285 | "version": 3 286 | }, 287 | "file_extension": ".py", 288 | "mimetype": "text/x-python", 289 | "name": "python", 290 | "nbconvert_exporter": "python", 291 | "pygments_lexer": "ipython3", 292 | "version": "3.8.10" 293 | } 294 | }, 295 | "nbformat": 4, 296 | "nbformat_minor": 2 297 | } 298 | -------------------------------------------------------------------------------- /projects/heart-disease-risk/data_train.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # coding: utf-8 3 | 4 | # # Heart Disease Risk Analysis Data - Data Processing 5 | 6 | # ### Process the data 7 | # 8 | # - Load the data/2020/heart_2020_processed.csv 9 | # - Process the features 10 | # - Set the categorical features names 11 | # - Set the numeric features names 12 | # - Set the target variable 13 | # - Split the data 14 | # - train/validation/test split with 60%/20%/20% distribution. 15 | # - Random_state 42 16 | # - Use strategy = y to deal with the class imbalanced problem 17 | # - Train the model 18 | # - LogisticRegression 19 | # - RandomForestClassifier 20 | # - XGBClassifier 21 | # - DecisionTreeClassifier 22 | # - Evaluate the models and compare them 23 | # - accuracy_score 24 | # - precision_score 25 | # - recall_score 26 | # - f1_score 27 | # - Confusion Matrix 28 | # 29 | 30 | import pandas as pd 31 | import numpy as np 32 | import matplotlib.pyplot as plt 33 | import seaborn as sns 34 | import pickle 35 | 36 | # Initialize the HeartDiseaseFactory and HeartDiseaseTrainData class 37 | from heart_disease_model_factory import HeartDiseaseTrainData, HeartDiseaseModelFactory 38 | 39 | 40 | # open the csv file and read it into a pandas dataframe to understand the data 41 | df_source = pd.read_csv('./data/2020/heart_2020_processed.csv', sep=',', quotechar='"') 42 | 43 | # save the original set of data 44 | df = df_source.copy() 45 | 46 | df.head() 47 | 48 | 49 | # Process the features 50 | 51 | # set the target feature 52 | target = 'heartdisease' 53 | 54 | train_data = HeartDiseaseTrainData(df, target) 55 | cat_features, num_features = train_data.process_features() 56 | 57 | # split the data in train/val/test sets 58 | # use 60%/20%/20% distribution with seed 1 59 | # use stratified sampling to ensure the distribution of the target feature is the same in all sets 60 | X_train, X_val, y_train, y_val, X_test, y_test = train_data.split_data(test_size=0.2, random_state=42) 61 | 62 | print(X_val.head()) 63 | 64 | 65 | # hot encode the categorical features for the train data 66 | model_factory = HeartDiseaseModelFactory(cat_features, num_features) 67 | X_train_std = model_factory.preprocess_data(X_train[cat_features + num_features], True) 68 | 69 | # hot encode the categorical features for the validation data 70 | X_val_std = model_factory.preprocess_data(X_val[cat_features + num_features], False) 71 | 72 | 73 | # Train the model 74 | model_factory.train(X_train_std, y_train) 75 | 76 | 77 | # Evaluate the model 78 | df_metrics = model_factory.evaluate(X_val_std, y_val) 79 | df_metrics.head() 80 | 81 | 82 | df_metrics[['model','accuracy', 'precision', 'recall', 'f1']].head() 83 | 84 | 85 | # plot df_metrics with the model name on the y-axis and metrics on the x-axis for all models and all metrics 86 | # Sort the DataFrame by a metric (e.g., accuracy) to display the best-performing models at the top 87 | df_metrics.sort_values(by='accuracy', ascending=False, inplace=True) 88 | # Define the models, metrics, and corresponding scores 89 | models = df_metrics['model'] 90 | metrics =['accuracy', 'precision', 'recall', 'f1'] 91 | scores = df_metrics[['accuracy', 'precision', 'recall', 'f1']] 92 | 93 | # Set the positions for the models 94 | model_positions = np.arange(len(models)) 95 | 96 | # Define the width of each bar group 97 | bar_width = 0.15 98 | 99 | # Create a grouped bar chart 100 | plt.figure(figsize=(10, 6)) 101 | 102 | for i, metric in enumerate(metrics): 103 | plt.barh(model_positions + i * bar_width, scores[metric.lower()], bar_width, label=metric) 104 | 105 | # Add score labels over the bars 106 | for index, row in df_metrics.iterrows(): 107 | score = row[metric.lower()] 108 | plt.text(score, index, f'{score:.3f}', va='top', ha='center', fontsize=9) 109 | 110 | # Customize the chart 111 | plt.yticks(model_positions, models) 112 | plt.xlabel('Score') 113 | plt.ylabel('ML Models') 114 | plt.title('Model Comparison for Heart Disease Prediction') 115 | plt.legend(loc='upper right') 116 | 117 | plt.savefig('./images/ozkary-ml-heart-disease-model-evaluation.png') 118 | # Display the chart 119 | # plt.show() 120 | 121 | 122 | 123 | # ## Confusion Matrix Analysis 124 | 125 | 126 | from sklearn.metrics import confusion_matrix 127 | import seaborn as sns 128 | import matplotlib.pyplot as plt 129 | 130 | cms = [] 131 | model_names = [] 132 | total_samples = [] 133 | 134 | for model_name in df_metrics['model']: 135 | model_y_pred = df_metrics[df_metrics['model'] == model_name]['y_pred'].iloc[0] 136 | 137 | # Compute the confusion matrix 138 | cm = confusion_matrix(y_val, model_y_pred) 139 | cms.append(cm) 140 | model_names.append(model_name) 141 | total_samples.append(np.sum(cm)) 142 | 143 | # Create a 2x2 grid of subplots 144 | fig, axes = plt.subplots(2, 2, figsize=(10, 10)) 145 | 146 | # Loop through the subplots and plot the confusion matrices 147 | for i, ax in enumerate(axes.flat): 148 | cm = cms[i] 149 | im = ax.imshow(cm, interpolation='nearest', cmap=plt.cm.Blues) 150 | ax.figure.colorbar(im, ax=ax, shrink=0.6) 151 | 152 | # Set labels, title, and value in the center of the heatmap 153 | ax.set(xticks=np.arange(cm.shape[1]), yticks=np.arange(cm.shape[0]), 154 | xticklabels=["No Heart Disease", "Heart Disease"], yticklabels=["No Heart Disease", "Heart Disease"], 155 | title=f'{model_names[i]} (n={total_samples[i]})\n') 156 | 157 | # Loop to annotate each quadrant with its count 158 | for i in range(cm.shape[0]): 159 | for j in range(cm.shape[1]): 160 | ax.text(j, i, str(cm[i, j]), ha="center", va="center", color="gray") 161 | 162 | ax.title.set_fontsize(12) 163 | ax.set_xlabel('Predicted', fontsize=10) 164 | ax.set_ylabel('Actual', fontsize=10) 165 | ax.xaxis.set_label_position('top') 166 | 167 | # Adjust the layout 168 | plt.tight_layout() 169 | 170 | plt.savefig('./images/ozkary-ml-heart-disease-model-confusion-matrix.png') 171 | # plt.show() 172 | 173 | 174 | # get the metrics grid with total samples for confusion matrix analysis 175 | scores = df_metrics[['model','accuracy', 'precision', 'recall', 'f1']] 176 | scores['total'] = total_samples 177 | 178 | scores.head() 179 | 180 | print(cms) 181 | 182 | 183 | # ## Save the model 184 | # 185 | # - Save the best performing model 186 | # - Save the encoder 187 | 188 | # get the model and the dictionary vectorizer 189 | model = model_factory.models[model_name] 190 | encoder = model_factory.encoder 191 | 192 | # Save the XGBoost model to a file 193 | xgb_model_filename = './bin/hd_xgboost_model.pkl.bin' 194 | with open(xgb_model_filename, 'wb') as model_file: 195 | pickle.dump(model, model_file) 196 | 197 | # Save the DictVectorizer to a file 198 | dv_filename = './bin/hd_dictvectorizer.pkl.bin' 199 | with open(dv_filename, 'wb') as dv_file: 200 | pickle.dump(encoder, dv_file) 201 | 202 | 203 | -------------------------------------------------------------------------------- /projects/heart-disease-risk/fn-ai-ml-heart-disease/.gitignore: -------------------------------------------------------------------------------- 1 | bin 2 | obj 3 | csx 4 | .vs 5 | edge 6 | Publish 7 | 8 | *.user 9 | *.suo 10 | *.cscfg 11 | *.Cache 12 | project.lock.json 13 | 14 | /packages 15 | /TestResults 16 | 17 | /tools/NuGet.exe 18 | /App_Data 19 | /secrets 20 | /data 21 | .secrets 22 | appsettings.json 23 | local.settings.json 24 | 25 | node_modules 26 | dist 27 | 28 | # Local python packages 29 | .python_packages/ 30 | 31 | # Python Environments 32 | .env 33 | .venv 34 | env/ 35 | venv/ 36 | ENV/ 37 | env.bak/ 38 | venv.bak/ 39 | 40 | # Byte-compiled / optimized / DLL files 41 | __pycache__/ 42 | *.py[cod] 43 | *$py.class 44 | 45 | # Azurite artifacts 46 | __blobstorage__ 47 | __queuestorage__ 48 | __azurite_db*__.json -------------------------------------------------------------------------------- /projects/heart-disease-risk/fn-ai-ml-heart-disease/.vscode/extensions.json: -------------------------------------------------------------------------------- 1 | { 2 | "recommendations": [ 3 | "ms-azuretools.vscode-azurefunctions" 4 | ] 5 | } -------------------------------------------------------------------------------- /projects/heart-disease-risk/fn-ai-ml-heart-disease/Pipfile: -------------------------------------------------------------------------------- 1 | [[source]] 2 | url = "https://pypi.org/simple" 3 | verify_ssl = true 4 | name = "pypi" 5 | 6 | [packages] 7 | scikit-learn = "==1.2.2" 8 | flask = "*" 9 | gunicorn = "*" 10 | azure-functions = "*" 11 | 12 | [dev-packages] 13 | 14 | [requires] 15 | python_version = "3.8" 16 | -------------------------------------------------------------------------------- /projects/heart-disease-risk/fn-ai-ml-heart-disease/azure-deploy.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | # Variables 4 | resourceGroupName="dev-ai-ml-group" 5 | storageAccountName="devaimlstorage" 6 | functionAppName="fn-ai-ml-heart-disease" 7 | location="EastUS2" 8 | 9 | # Create a resource group 10 | az group create --name $resourceGroupName --location $location 11 | 12 | # Create a storage account 13 | az storage account create --name $storageAccountName --resource-group $resourceGroupName --location $location --sku Standard_LRS 14 | 15 | # Create a function app 16 | az functionapp create --name $functionAppName --resource-group $resourceGroupName --consumption-plan-location $location --runtime python --runtime-version 3.8 --storage-account $storageAccountName --os-type Linux --functions-version 3 17 | 18 | # Retrieve the storage account connection string 19 | connectionString=$(az storage account show-connection-string --name $storageAccountName --resource-group $resourceGroupName --output tsv) 20 | 21 | # Configure the function app settings 22 | az functionapp config appsettings set --name $functionAppName --resource-group $resourceGroupName --settings AzureWebJobsStorage="$connectionString" 23 | -------------------------------------------------------------------------------- /projects/heart-disease-risk/fn-ai-ml-heart-disease/getting_started.md: -------------------------------------------------------------------------------- 1 | ## Getting Started with Azure Function 2 | ### Last updated: March 8th 2021 3 | 4 | #### Project Structure 5 | The main project folder () can contain the following files: 6 | 7 | * **local.settings.json** - Used to store app settings and connection strings when running locally. This file doesn't get published to Azure. To learn more, see [local.settings.file](https://aka.ms/azure-functions/python/local-settings). 8 | * **requirements.txt** - Contains the list of Python packages the system installs when publishing to Azure. 9 | * **host.json** - Contains global configuration options that affect all functions in a function app. This file does get published to Azure. Not all options are supported when running locally. To learn more, see [host.json](https://aka.ms/azure-functions/python/host.json). 10 | * **.vscode/** - (Optional) Contains store VSCode configuration. To learn more, see [VSCode setting](https://aka.ms/azure-functions/python/vscode-getting-started). 11 | * **.venv/** - (Optional) Contains a Python virtual environment used by local development. 12 | * **Dockerfile** - (Optional) Used when publishing your project in a [custom container](https://aka.ms/azure-functions/python/custom-container). 13 | * **tests/** - (Optional) Contains the test cases of your function app. For more information, see [Unit Testing](https://aka.ms/azure-functions/python/unit-testing). 14 | * **.funcignore** - (Optional) Declares files that shouldn't get published to Azure. Usually, this file contains .vscode/ to ignore your editor setting, .venv/ to ignore local Python virtual environment, tests/ to ignore test cases, and local.settings.json to prevent local app settings being published. 15 | 16 | Each function has its own code file and binding configuration file ([**function.json**](https://aka.ms/azure-functions/python/function.json)). 17 | 18 | #### Developing your first Python function using VS Code 19 | 20 | If you have not already, please checkout our [quickstart](https://aka.ms/azure-functions/python/quickstart) to get you started with Azure Functions developments in Python. 21 | 22 | #### Publishing your function app to Azure 23 | 24 | For more information on deployment options for Azure Functions, please visit this [guide](https://docs.microsoft.com/en-us/azure/azure-functions/create-first-function-vs-code-python#publish-the-project-to-azure). 25 | 26 | #### Next Steps 27 | 28 | * To learn more about developing Azure Functions, please visit [Azure Functions Developer Guide](https://aka.ms/azure-functions/python/developer-guide). 29 | 30 | * To learn more specific guidance on developing Azure Functions with Python, please visit [Azure Functions Developer Python Guide](https://aka.ms/azure-functions/python/python-developer-guide). -------------------------------------------------------------------------------- /projects/heart-disease-risk/fn-ai-ml-heart-disease/host.json: -------------------------------------------------------------------------------- 1 | { 2 | "version": "2.0", 3 | "logging": { 4 | "applicationInsights": { 5 | "samplingSettings": { 6 | "isEnabled": true, 7 | "excludedTypes": "Request" 8 | } 9 | } 10 | }, 11 | "extensionBundle": { 12 | "id": "Microsoft.Azure.Functions.ExtensionBundle", 13 | "version": "[3.*, 4.0.0)" 14 | } 15 | } -------------------------------------------------------------------------------- /projects/heart-disease-risk/fn-ai-ml-heart-disease/main.tf: -------------------------------------------------------------------------------- 1 | # main.tf 2 | 3 | provider "azurerm" { 4 | features {} 5 | } 6 | 7 | resource "azurerm_resource_group" "example" { 8 | name = "dev-ai-ml-group" 9 | location = "East US 2" 10 | } 11 | 12 | resource "azurerm_storage_account" "example" { 13 | name = "devaimlstorage" 14 | resource_group_name = azurerm_resource_group.example.name 15 | location = azurerm_resource_group.example.location 16 | account_tier = "Standard" 17 | account_replication_type = "LRS" 18 | } 19 | 20 | resource "azurerm_function_app" "example" { 21 | name = "fn-ai-ml-heart-disease" 22 | location = azurerm_resource_group.example.location 23 | resource_group_name = azurerm_resource_group.example.name 24 | app_service_plan_id = azurerm_function_app_service_plan.example.id 25 | storage_connection_string = azurerm_storage_account.example.primary_connection_string 26 | version = "~3" 27 | os_type = "Linux" 28 | 29 | app_settings = { 30 | AzureWebJobsStorage = azurerm_storage_account.example.primary_connection_string 31 | } 32 | } 33 | 34 | resource "azurerm_function_app_service_plan" "example" { 35 | name = "example-appserviceplan" 36 | location = azurerm_resource_group.example.location 37 | resource_group_name = azurerm_resource_group.example.name 38 | kind = "FunctionApp" 39 | reserved = true 40 | 41 | sku { 42 | tier = "Dynamic" 43 | size = "Y1" 44 | } 45 | } 46 | -------------------------------------------------------------------------------- /projects/heart-disease-risk/fn-ai-ml-heart-disease/predict/__init__.py: -------------------------------------------------------------------------------- 1 | # import libraries for web API support 2 | import azure.functions as func 3 | import sklearn 4 | import json 5 | # import the data prediction module 6 | from .data_predict import predict, probability_label 7 | 8 | VERSION = '1.0.0' 9 | LABEL = 'Heart Disease Risk Prediction API' 10 | 11 | print(f"Loading {LABEL} {VERSION}") 12 | print("sklearn version", sklearn.__version__) 13 | 14 | # define the predict endpoint function 15 | def predict_cases(data): 16 | print("Predict endpoint called") 17 | 18 | data_dict = json.loads(data) 19 | 20 | # get the prediction 21 | predictions = predict(data_dict) 22 | 23 | # create a risk_score based on the prediction 24 | results = [] 25 | 26 | # for each prediction, add the risk_score and risk_label 27 | for score in predictions: 28 | # create a risk_score based on the prediction 29 | risk_score = round(score,4) 30 | risk_label = probability_label(score) 31 | results.append({'risk_score': risk_score, 'risk_label': risk_label}) 32 | 33 | # return the prediction 34 | return results 35 | 36 | 37 | def main(req: func.HttpRequest) -> func.HttpResponse: 38 | try: 39 | data = req.get_json() 40 | if data: 41 | # Make predictions 42 | predictions = predict_cases(data) 43 | 44 | # Return the predictions as a JSON response 45 | return func.HttpResponse(json.dumps(predictions), mimetype="application/json") 46 | else: 47 | return func.HttpResponse("Invalid input data.", status_code=400) 48 | except Exception as e: 49 | return func.HttpResponse(f"An error occurred: {str(e)}", status_code=500) -------------------------------------------------------------------------------- /projects/heart-disease-risk/fn-ai-ml-heart-disease/predict/data_predict.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # coding: utf-8 3 | 4 | # Heart Disease Risk Analysis Data - Predicting Heart Disease 5 | import os 6 | import pickle 7 | 8 | # Determine the path to the model file relative to the script location and load the models 9 | folder = os.path.dirname(__file__) 10 | model_filename = os.path.join(folder, 'hd_xgboost_model.pkl') 11 | dv_filename = os.path.join(folder, 'hd_dictvectorizer.pkl') 12 | 13 | # Load the model and dv from the files 14 | with open(model_filename, 'rb') as model_file: 15 | loaded_model = pickle.load(model_file) 16 | 17 | with open(dv_filename, 'rb') as dv_file: 18 | loaded_dv = pickle.load(dv_file) 19 | 20 | def probability_label(probability): 21 | 22 | labels = ['none','low', 'medium', 'high'] 23 | label = 'unknown' 24 | 25 | # return the label based on the probability 26 | if probability < 0.3: 27 | label = labels[0] 28 | elif probability < 0.50: 29 | label = labels[1] 30 | elif probability < 0.75: 31 | label = labels[2] 32 | elif probability >= 0.75: 33 | label = labels[3] 34 | 35 | return label 36 | 37 | def predict(data): 38 | # Transform the data 39 | X = loaded_dv.transform(data) 40 | # Predict the probability 41 | y_pred = loaded_model.predict_proba(X)[:, 1] 42 | 43 | return y_pred 44 | 45 | -------------------------------------------------------------------------------- /projects/heart-disease-risk/fn-ai-ml-heart-disease/predict/function.json: -------------------------------------------------------------------------------- 1 | { 2 | "scriptFile": "__init__.py", 3 | "bindings": [ 4 | { 5 | "authLevel": "anonymous", 6 | "type": "httpTrigger", 7 | "direction": "in", 8 | "name": "req", 9 | "route": "predict", 10 | "methods": [ 11 | "get", 12 | "post" 13 | ] 14 | }, 15 | { 16 | "type": "http", 17 | "direction": "out", 18 | "name": "$return" 19 | } 20 | ] 21 | } -------------------------------------------------------------------------------- /projects/heart-disease-risk/fn-ai-ml-heart-disease/predict/hd_dictvectorizer.pkl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ozkary/machine-learning-engineering/ed19b242fd476d9e8ad6f80385e75241acbc2c49/projects/heart-disease-risk/fn-ai-ml-heart-disease/predict/hd_dictvectorizer.pkl -------------------------------------------------------------------------------- /projects/heart-disease-risk/fn-ai-ml-heart-disease/predict/hd_xgboost_model.pkl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ozkary/machine-learning-engineering/ed19b242fd476d9e8ad6f80385e75241acbc2c49/projects/heart-disease-risk/fn-ai-ml-heart-disease/predict/hd_xgboost_model.pkl -------------------------------------------------------------------------------- /projects/heart-disease-risk/fn-ai-ml-heart-disease/requirements.txt: -------------------------------------------------------------------------------- 1 | # Do not include azure-functions-worker as it may conflict with the Azure Functions platform 2 | 3 | azure-functions 4 | scikit-learn==1.2.2 5 | -------------------------------------------------------------------------------- /projects/heart-disease-risk/heart_disease_model_base.py: -------------------------------------------------------------------------------- 1 | from abc import ABC, abstractmethod 2 | from sklearn.model_selection import train_test_split 3 | 4 | class HeartDiseaseModelBase(ABC): 5 | """ 6 | Abstract class for heart disease risk prediction model 7 | """ 8 | 9 | def __init__(self): 10 | # Common initialization code 11 | pass 12 | 13 | def split_data(self, df, features, test_size=0.2, random_state=42, stratify=None): 14 | """ 15 | Split the data into training and validation sets 16 | """ 17 | # split the data in train/val/test sets, with 60%/20%/20% distribution with seed 1 18 | X = df[features] 19 | y = df[self.target_variable] 20 | X_full_train, X_test, y_full_train, y_test = train_test_split(X, y, test_size=test_size, random_state=random_state, stratify=stratify) 21 | 22 | # .25 splits the 80% train into 60% train and 20% val 23 | X_train, X_val, y_train, y_val = train_test_split(X_full_train, y_full_train, test_size=0.25, random_state=random_state) 24 | 25 | X_train = X_train.reset_index(drop=True) 26 | X_val = X_val.reset_index(drop=True) 27 | X_test = X_test.reset_index(drop=True) 28 | 29 | return X_train, X_val, X_test 30 | 31 | @abstractmethod 32 | def preprocess_data(self, data, is_training=True): 33 | """ 34 | Preprocess the data and return the standardized features and target variable 35 | """ 36 | pass 37 | 38 | @abstractmethod 39 | def train(self, df_train): 40 | """ 41 | Train the model on the training data set 42 | """ 43 | pass 44 | 45 | @abstractmethod 46 | def evaluate(self, df_val, threshold=0.5): 47 | """ 48 | Evaluate the model on the validation data set and return the accuracy, precision, recall, and F1 score 49 | """ 50 | pass 51 | 52 | def predict(self, df_val): 53 | """ 54 | Predict the target variable on the validation data set and return the predictions 55 | """ 56 | X_val, _ = self.preprocess_data(df_val, is_training=False) 57 | probs = self.model.predict_proba(X_val) 58 | return probs 59 | -------------------------------------------------------------------------------- /projects/heart-disease-risk/heart_disease_model_factory.py: -------------------------------------------------------------------------------- 1 | from sklearn.preprocessing import StandardScaler, OneHotEncoder 2 | from sklearn.feature_extraction import DictVectorizer 3 | from sklearn.compose import ColumnTransformer 4 | from sklearn.pipeline import Pipeline 5 | 6 | from sklearn.linear_model import LogisticRegression 7 | from sklearn.ensemble import RandomForestClassifier 8 | from xgboost import XGBClassifier 9 | from sklearn.tree import DecisionTreeClassifier 10 | 11 | from sklearn.metrics import precision_score, recall_score, accuracy_score, f1_score 12 | from sklearn.model_selection import train_test_split 13 | import pandas as pd 14 | import numpy as np 15 | 16 | class HeartDiseaseTrainData(): 17 | """ 18 | Data class for training data 19 | """ 20 | def __init__(self, df, target_variable): 21 | self.df = df 22 | self.target_variable = target_variable 23 | 24 | # get the numeric and categorical features 25 | self.numeric_features = None 26 | self.categorical_features = None 27 | 28 | # list of all features 29 | self.all_features = None 30 | 31 | def process_features(self): 32 | """ 33 | Process the features 34 | """ 35 | # get the numeric and categorical features 36 | self.numeric_features = list(self.df.select_dtypes(include=[np.number]).columns) 37 | self.categorical_features = list(self.df.select_dtypes(include=['object']).columns) 38 | 39 | # remove the target feature from the list of numeric features 40 | if self.target_variable in self.numeric_features: 41 | self.numeric_features.remove(self.target_variable) 42 | 43 | print('Categorical features',self.categorical_features) 44 | print('Numerical features',self.numeric_features) 45 | print('Target feature',self.target_variable) 46 | 47 | # create a list of all features 48 | self.all_features = self.categorical_features + self.numeric_features 49 | 50 | return self.categorical_features, self.numeric_features 51 | 52 | 53 | def split_data(self, test_size=0.2, random_state=42): 54 | """ 55 | Split the data into training and validation sets 56 | """ 57 | # split the data in train/val/test sets, with 60%/20%/20% distribution with seed 1 58 | X = self.df[self.all_features] 59 | y = self.df[self.target_variable] 60 | X_full_train, X_test, y_full_train, y_test = train_test_split(X, y, test_size=test_size, random_state=random_state, stratify=y) 61 | 62 | # .25 splits the 80% train into 60% train and 20% val 63 | X_train, X_val, y_train, y_val = train_test_split(X_full_train, y_full_train, test_size=0.25, random_state=random_state) 64 | 65 | X_train = X_train.reset_index(drop=True) 66 | X_val = X_val.reset_index(drop=True) 67 | y_train = y_train.reset_index(drop=True) 68 | y_val = y_val.reset_index(drop=True) 69 | X_test = X_test.reset_index(drop=True) 70 | y_test = y_test.reset_index(drop=True) 71 | 72 | # print the shape of all the data splits 73 | print('X_train shape', X_train.shape) 74 | print('X_val shape', X_val.shape) 75 | print('X_test shape', X_test.shape) 76 | print('y_train shape', y_train.shape) 77 | print('y_val shape', y_val.shape) 78 | print('y_test shape', y_test.shape) 79 | 80 | return X_train, X_val, y_train, y_val, X_test, y_test 81 | 82 | 83 | class HeartDiseaseModelFactory(): 84 | """ 85 | Factory class for heart disease risk prediction model 86 | """ 87 | 88 | def __init__(self, categorical_features, numeric_features): 89 | # Initialize the preprocessing transformers 90 | self.scaler = StandardScaler() 91 | self.encoder = DictVectorizer(sparse=False) 92 | 93 | self.numeric_features = numeric_features 94 | self.categorical_features = categorical_features 95 | 96 | # Define the preprocessing steps 97 | numeric_transformer = Pipeline(steps=[('scaler', self.scaler)]) 98 | categorical_transformer = Pipeline(steps=[('encoder', self.encoder)]) 99 | self.preprocessor = ColumnTransformer( 100 | transformers=[ 101 | ('num', numeric_transformer, self.numeric_features), 102 | ('cat', categorical_transformer, self.categorical_features) 103 | ]) 104 | self.models = None 105 | self.model = None 106 | 107 | def preprocess_data(self, X, is_training=True): 108 | """ 109 | Preprocess the data for training or validation 110 | """ 111 | X_dict = X.to_dict(orient='records') 112 | # X_num = X[self.numeric_features] 113 | 114 | # processor = self.preprocessor 115 | 116 | if is_training: 117 | X_std = self.encoder.fit_transform(X_dict) 118 | # Fit and transform for training data 119 | # X_cat_std = processor.fit_transform(X_dict) 120 | # X_num_std = processor.fit_transform(X_num) 121 | # X_std = np.concatenate((X_num_std, X_cat_std), axis=1) 122 | else: 123 | X_std = self.encoder.transform(X_dict) 124 | # Only transform for validation data 125 | # X_cat_std = processor.transform(X_dict) 126 | # X_num_std = processor.transform(X_num) 127 | # X_std = np.concatenate((X_num_std, X_cat_std), axis=1) 128 | 129 | # Return the standardized features and target variable 130 | return X_std 131 | 132 | def train(self, X_train, y_train): 133 | 134 | if self.models is None: 135 | self.models = { 136 | 'logistic_regression': LogisticRegression(C=10, max_iter=1000, random_state=42), 137 | 'random_forest': RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42, n_jobs=-1), 138 | 'xgboost': XGBClassifier(n_estimators=100, max_depth=5, random_state=42, n_jobs=-1), 139 | 'decision_tree': DecisionTreeClassifier(max_depth=5, random_state=42) 140 | } 141 | 142 | for model in self.models.keys(): 143 | print('Training model', model) 144 | self.models[model].fit(X_train, y_train) 145 | 146 | def evaluate(self, X_val, y_val, threshold=0.5): 147 | """ 148 | Evaluate the model on the validation data set and return the predictions 149 | """ 150 | 151 | # create a dataframe to store the metrics 152 | df_metrics = pd.DataFrame(columns=['model', 'accuracy', 'precision', 'recall', 'f1', 'y_pred']) 153 | 154 | # define the metrics to be calculated 155 | fn_metrics = { 'accuracy': accuracy_score,'precision': precision_score,'recall': recall_score,'f1': f1_score} 156 | 157 | # loop through the models and get its metrics 158 | for model_name in self.models.keys(): 159 | 160 | model = self.models[model_name] 161 | 162 | # The first column (y_pred_proba[:, 0]) is for class 0 ("N") 163 | # The second column (y_pred_proba[:, 1]) is for class 1 ("Y") 164 | y_pred = model.predict_proba(X_val)[:,1] 165 | # get the binary predictions 166 | y_pred_binary = np.where(y_pred > threshold, 1, 0) 167 | 168 | # add a new row to the dataframe for each model 169 | df_metrics.loc[len(df_metrics)] = [model_name, 0, 0, 0, 0, y_pred_binary] 170 | 171 | # get the row index 172 | row_index = len(df_metrics)-1 173 | 174 | # Evaluate the model metrics 175 | for metric in fn_metrics.keys(): 176 | score = fn_metrics[metric](y_val, y_pred_binary) 177 | df_metrics.at[row_index,metric] = score 178 | 179 | return df_metrics 180 | 181 | def save(model_name, path): 182 | """ 183 | Save the model 184 | """ 185 | # get the model from the models dictionary 186 | model = self.models[model_name] 187 | 188 | if model is None: 189 | print('Model not found') 190 | return 191 | 192 | # save the model 193 | model.save(path) 194 | 195 | 196 | def predict(self, X_val): 197 | """ 198 | Predict the target variable on the validation data set and return the predictions 199 | """ 200 | probs = self.model.predict_proba(X_val) 201 | return probs 202 | -------------------------------------------------------------------------------- /projects/heart-disease-risk/heart_disease_random_forest.py: -------------------------------------------------------------------------------- 1 | from heart_disease_model_base import HeartDiseaseModelBase 2 | 3 | class HeartDiseaseRandomForest(HeartDiseaseModelBase): 4 | def __init__(self, numeric_features, categorical_features): 5 | # ... (your class implementation) 6 | raise NotImplementedError 7 | 8 | def preprocess_data(self, data, is_training=True): 9 | # ... (your implementation) 10 | raise NotImplementedError 11 | 12 | def train(self, df_train): 13 | # ... (your implementation) 14 | raise NotImplementedError 15 | 16 | 17 | def evaluate(self, df_val): 18 | # ... (your implementation) 19 | raise NotImplementedError 20 | -------------------------------------------------------------------------------- /projects/heart-disease-risk/images/ozkary-ml-heart-disease-azure-function.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ozkary/machine-learning-engineering/ed19b242fd476d9e8ad6f80385e75241acbc2c49/projects/heart-disease-risk/images/ozkary-ml-heart-disease-azure-function.png -------------------------------------------------------------------------------- /projects/heart-disease-risk/images/ozkary-ml-heart-disease-class-balance.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ozkary/machine-learning-engineering/ed19b242fd476d9e8ad6f80385e75241acbc2c49/projects/heart-disease-risk/images/ozkary-ml-heart-disease-class-balance.png -------------------------------------------------------------------------------- /projects/heart-disease-risk/images/ozkary-ml-heart-disease-feature-analysis.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ozkary/machine-learning-engineering/ed19b242fd476d9e8ad6f80385e75241acbc2c49/projects/heart-disease-risk/images/ozkary-ml-heart-disease-feature-analysis.png -------------------------------------------------------------------------------- /projects/heart-disease-risk/images/ozkary-ml-heart-disease-feature-importance.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ozkary/machine-learning-engineering/ed19b242fd476d9e8ad6f80385e75241acbc2c49/projects/heart-disease-risk/images/ozkary-ml-heart-disease-feature-importance.png -------------------------------------------------------------------------------- /projects/heart-disease-risk/images/ozkary-ml-heart-disease-model-confusion-matrix.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ozkary/machine-learning-engineering/ed19b242fd476d9e8ad6f80385e75241acbc2c49/projects/heart-disease-risk/images/ozkary-ml-heart-disease-model-confusion-matrix.png -------------------------------------------------------------------------------- /projects/heart-disease-risk/images/ozkary-ml-heart-disease-model-evaluation.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ozkary/machine-learning-engineering/ed19b242fd476d9e8ad6f80385e75241acbc2c49/projects/heart-disease-risk/images/ozkary-ml-heart-disease-model-evaluation.png --------------------------------------------------------------------------------